New release

A new release is available on the Download section. Few small improvements detailed below are coming with it. Feel free to download and use it. If you have any issue running it, please report.

Version 0.0.6b improvements

Client side:

  • Ability to disable bandwidth speed limitation by setting it to 0;
  • Ability to disable daily and monthly bandwidth limits by setting it to 0;
  • Handling of more http code errors;

Server side:

  • Modification of the data structure to improve data handling speed;
  • Improvement of the duplicates detection.

Many small improvments.

While I’m still working on the loops and duplicates detections, I’m still working on improving the crawler and the servers.

Here is a list of the last updates done.

  • Some update done on the duplicates detection. Working quite well but still need some more work.
  • Data send by the crawlers is now stored separately in order to be controlled and parsed before being sent for parsing.
  • Links destination and source are now stored in the server and not only the URLs. Also, place had been reserved to store the keywords related to the links.
  • On the client side, a 30 seconds delay has been added between 2 calls to the same server to avoid being banned from those servers. I personally got banned from Amazon because the crawler was calling them to quickly (no delay).
  • Duplicate pages are reported so some analysis can be done.
  • URL filtering/parsing has been added in both the client and the server side. On the client side to reduce the number of links sent to the server. On the server side to ensure new filters can be added quickly even if all the clients are not updated.
  • Application is now able to crawl and parser servers on different ports (Not just 80) and protocols (not just HTTP).
  • Robots.txt retrieval and parsing had been improved.
  • Since database can now receive multiple terabytes of data, the filter on .CA urls has been removed.
  • Crawlers are now provided a way better random generated list of pages to parse.

And here are the next updates I’m working one:

  • On the client side, store the last 1000 failing URLs to have an idea of the network issues occurring with the crawler.
  • On the server side, analyze the duplicated pages to understand if there is any pattern (URL parameter, loop, etc) that can be detected and added to the filters.
  • Add parsing of of document format (PDF, etc.)
  • Start the work on the page ranking.
  • And still all the required improvements on the client side (User interface to update the parameters, etc.)

All the last modifications I made are compatible with the existing client, so there should not be any issue with the running applications. Regarding the work already done as of today, I had to reset the data. Because of the lack of loops/duplicates detection, there was more than 60% of the content duplicated. As a result, the data quality was not good enough, and cleaning it will have take more time than what it’s going to take to read it again.

So the number of pages retrieved come down from 12M to only 3000 entries but will grow up very quickly. A statistic page will be put in place soon to provide live information about the data size.

How to discard looping pages part 1.

I don’t even know when this will last, so I can already titled it “part 1” since I’m sure there will be more to come.

The idea here is to avoid pages loops. What I mean by page loop is a page which point to itself but where the URL and/or content is slightly different. There can be multiple kind of loops. Lets try to see some and see how (if possible) it can be detected and fixed.

Kind of loops

Loop in parameters

The first kind of loop I figured is the one based on the URL parameters.

Like is a page which contain two links which point to the page URL adding “p=true” or “p=false”. That gives you a link to and a link to .

Since it’s pointing to the same page, that will, again, build 2 new links with 2 new URLs pointing to the same page.

Loop in the path

The same way you can have loops in parameters, you can have loops in the path. Let’s imagine someone configured /url as an application on his server. This application is building a page and is putting links into it by adding /xxx and /yyy to the current URL. You will end up with a /url page containing /url/xxx and /url/yyy links. If you follow the first link, you will be redirected to a page which contain /url/xxx/xxx and /url/xxx/yyy and so on.

Loop with different URL

I figured this 3rd kind of loops only yesterday. It happend when a page is referencing itself, without adding any parameter or path, but simply by changing one parameters value. Ir can be a session id if your cookies are disabled, but it can also be a timestamp. Like display a page where you have a link to /url. If the page automatically add the timestamp to the links, you will have another link to in this page. You will land on the same page, displaying the same content, but with a different URL.

Loop detection

Loop in parameters

This kind of loop is already detected and corrected by the crawler. Basically, all the parameters are retrieved and only one occurrence of each is putted back to the URL. So at the end, all the duplicates are removed, which reduced the list of possible pages to a minimum. So webcrawlers seems to be “simply” removing the parameters.

Loop in the path

Like many other web crawlers are doing, we can detect a loop in the bath using a simple regular expression to detect if there is more than 3 times the same sub-directory in the URL. Like is containing 3 times “/b/c” and so will be discarted.

Loop with different URL

Some existing website have similar issues which might cause some troubles to the bots. A very good example is got “About Amazon” page.  The link to it (sorry, it’s a bit long) is: you click on this link, you will land on the amazon page. If you look at the left, you have a link to the current page you are browsing under the name “About Amazon”. Open this link in a new window. And now compare the 2 pages. First, you will notice that the content is different. Not the same articles displayes at the bottom, so not the same page content. Now, look at the URL. It will be almost the same as the previous one, but with a different pf_rd_m parameter. So for the bot, it’s a new content, with a new URL, so it’s a new page. Wrong. It’s the same page. But there is almost no way to detect it. And the crawlers will read them, follow the links, read the new pages, and so on, forever, until you find a way to deted that. SESSIONID parameters and other jsession parameters are causing similar issues by referencing the same page multiple times with different URLs. I searched over the web and read a lot about that the last few days but seems there is no real good solution to that. However, few filters can be put in place to reduce the amount of duplicates. The first one is to remove from the URLs known session parameters like jsessionid and others. The second is tu hash the page content and discard it on the server side if another page on the same website has the same hash code. Even with those 2 filters, amazon URL below will still be considered as a new page. Removing all the pages parameters like some crawlers are going will solve this issue, but that will also discard to many pages, like php forums threads, etc. So I wil have to think about some additionnal filters to prevent such pages to be retrieved.

Other issues.

Even if it’s possible to detect many kind of loops, there is some loops which will never be detected. Someone can build a webcrawler trap (a.k.a. bot trap). It’s as simple as an application which handle all the requests to a server, and serve them with some random link and content into it, pointing to the same domain name. Since both the URL and the content are always different, you will have no way to identify that your bot is traped and calling always the same a single page/application.


New server architecture in place.

It’s done!

There is now 6 servers serving workloads and receiving the results. The time to retrieve and submit data might be constant what ever the size of the database will be.

The database now contain XXXXX entries and is used at only 2.07 % of its capacity. If more capacity is required, it will be easy to add some servers/disk space to increase it without impacting the running applications.

I’m very happy with the results. I will now most probably add some load-balancing in front of the application server to split it between 2 different servers.

The total response time for both get load and submit results is now always below 2 seconds. From that, about 50% is because of the client to server network transfer.

Now  that all this new architecture is in place, I will be able to work back on the crawler algorithm.

Server architecture

The more lines we have on the server side, the slower the server is…

There is currently 13 038 474 URLs in the system. With this data size, it’s taking only 726 milliseconds to provide workload to a crawler which is asking for more work, but it’s taking 47 seconds to commit the results… It was taking one second when the table was 2B lines big, 10 seconds when it was 5B lines, etc. The bigger the able is, the slower the results are.

As a result, I worked more actively on the new architecture. The data retrieval is still taking few milliseconds and should be a bit faster, but the data commit is now taking less than 2 seconds to commit the data in a 40B table! It was taking 2s with a 1B table, it’s still taking 2s with a 40B table and this time will be constant. There will be no degradation of the response time even if the load is increasing. The final goal will be to reduce the data commitment to less than a second.

So based on this statement, there is no much modifications I have done on the client side. You can still download and run the current version available on the download section.

Performances: done!

A lot append those last few days on the DistParser application.

First, again, about performances. The server is now able to serve new workload and handle results within less than half a second (About 200ms). This is WAY faster than the 40 seconds response time from a week ago. I don’t think there is much more thing which can be improved for now. When the size of the data will grow, I will move to a clustered architecture. But for now, the regular database can still handle the load. So I might not have to talk about performances for some time.

Client side, I made many updates to the application.

Everything from the last todo has been done! And even more.

  • Robots.txt query and parsing;
  • Robots.txt caching;
  • Display daily and monthly total for the bandwidth;
  • Limit the instant bandwidth usage;
  • Limit the monthly threshold;
  • Limit of download/upload speed;
  • Save the bandwidth usage when the application is closed to have a daily total;
  • Validate URLs format;
  • Add logs.

First, regarding the robots.txt rule. The crawler is now downloading the robots.txt file before querying any URL. The required URL is validated against this file. Also, to reduce the bandwidth usage, this robots.txt file is stored locally on a disk cache. The cache parameters are not yet configurable but will be in a near future.

Regarding the bandwidth usage, the application is now keeping track of all the bandwidth used, including upload and download, and is displaying the daily usage and the monthly usage in the title bar. Monthly usage billing period can be configured to start on a specific day of the month. Also, daily and monthly total usage limitation are now in place. So you configure the tool correctly, you can let it run as often as you want without risking of going over your provider allowed bandwidth. The bandwidth is stored in a local file. So each time you are closing the crawler, the information is stored. Each time you are opening it back, information is restored. So you will always see accurate numbers.

On the database side, since there was no URL validation, many malformed URL were stored and are generating crawlers work which can be avoid. So an URL format validation has been added. A database cleanup has been made, which has reduced a lot the number of URLs stored. Also, a defect on the storage side has corrupted some URL identification. I had to delete those entries too.

As a result, the database now stores 2 212 550 URLs

Many other small updates/improvements were made too, like the add of the logs to figure what the application is doing.

If you which to participate, don’t hesitate to download the Crawler and run it. You can see the “How to start?” section.

So for the next release, here are the new features I would like to add:

  • Allow robots.txt disk cache size configuration;
  • Build an interface to see/update parameters;
  • Build an interface to see crawler statistics (Cache info, bandwidth, etc.);
  • Add a graph with the number of pages retrieves/parsed and the domains found;
  • Improve Domain vs URL split rules (to handle ? split when ? not present);
  • Rename application packages.

Performances, again and again.

Retrieving data on internet quickly is fine, but if it’s taking twice the time to store it, it’s useless…

Over the last few days I worked a lot on the performances improvement. I was able to make good progress, and I have good plan for the future.

I have identified 3 areas of possible improvement in the actual architecture.

  1. The disks access are very slow;
  2. The database inserts/updates are slow;
  3. The network can be improved.

To address the disk speed issue, the database just been moved to a SSD drive. Access time is about 100 times faster, and transfers rate 10 times. This will be a big improvement. Also, I have modified the way data is inserted into the database. And last, the server network connection will be changed from wireless to wired.

With the 2 first modifications, the time for a client to submit results went down from 40 seconds to less than 1 second. And from that second, 50% of it is used by network transfers. Which mean only half a second is required to submit data load and query new workload.

For the future, I’m already working on a distributed server side database to store and process the data even faster. But this is more for long term. As long as the actual database if able to serve all the crawlers requests, I will keep it in place.

So now I will most probably start to work back on the client from the last ToDo:

  • Robot.txt query and parsing;
  • Display daily and monthly total for the bandwidth;
  • Limit the instant bandwidth usage;
  • Limit the monthly threshold;
  • Add a graph with the number of pages retrieves/parsed and the domains found;
  • Save the bandwidth usage when the application is closed to have a daily total;

Tomorrow I will try to post an update version. So feel free to download, install and run the application.

Regarding the statistics, there is now 4 205 860 links into the database from 59 393 domains and 897 928 and those pages got already parsed.


Database performance issues part 2.

I spend the few last days working on he database performances, and it’s till not meeting my expectations. I was able to improve it enough to re-open the server, so you can stat again to run the clients.

However, this will still need a lot of improvement. I’m looking at a completely different database engine/schema to store the data. But that will take time to implement. At lest one week. So in the meantime, I will continue to tweak the existing database a bit and let the clients run.

More to come.

On the other side, I’m now thinking about what can be done with the data retrieved. Maybe the crawler will need to retrieve a bit more information from the pages.

Database performance issues.

Yesterday the number of URLs stored into the database almost reached 4M. But the issue is it’s now slower each time an insert is done. And not just a bit slower, but WAY slower. I tried to do some improvement by changing the database structure or even the database engine but it’s still the same.

To insert on the server the result of 500 pages been parsed, it’s now taking 40 seconds. Which is not acceptable, since it’s taking almost the same time for one client to parse same number of URLs.

When inserting the load into the database, CPU is used at less than 10%, but the hard-drive is used at 100% for the 40 seconds. So there might be some place for improvement on that side.

So I now have few options:

  • I rework again on the database structure to find what’s wrong with it.
  • I change the database engine for another one (again)
  • I completely change the database system I’m using
  • I improve the hardware.

In the meantime, I stopped the server to not overload it. I hope to be able to restart it in the new few days. I will try to keep one client running locally.

Stream parsing improvement.

First thing, you will notice that I have decided to be more creative for the titles 😉 I will try to use as a title the main subject of the post.

Over the week-end, I worked on the server performances. It was way to slow.

First, I moved the server and the database to a new server about 8 time faster than the previous one with 8 times the memory. That will help to increase the number of supporter crawler. I have estimated 60 participants to be able to be handled by the server. Which mean I can now expect the server to be able to handle up to 480 participants. Also, I worked on the part of the code which handle the stream received from the crawler and prepare it for treatment. The initial performance on the dev machine was 94315 pages handled per seconds. After optimization, it’s now able to handle 366709. I looked at the code so many times that I think there is no other way to improve it. The size of the stream transferred as been reduced by only few bytes, but at the end it’s almost 4 times faster. Which is a very good improvement.

On the other side, I build the migration procedure to migrate the existing database to the new schema. It’s taking about one hour to process the table. So there will be a one hour down time for the server sometime in the middle of next week. I’m also expecting a big improvement on that side. For now, it’s taking up to 3 seconds to submit the 500 pages parsing results, and retrieve more work. I will see what will be the improvement but my goal is to go under one seconds for the total of the two operations.

Regarding the ToDo from the last post, here is what still need to be done:

  • Display daily and monthly total for the bandwidth
  • Limit the instant bandwidth usage
  • Limit the monthly threshold.
  • Add a graph with the number of pages retrieves/parsed and the domains found.
  • Save the bandwidth usage when the application is closed to have a daily total.

All of that is on the client side. I will first need to complete the server side improvement before I take care of that. So I’m not expecting any of those items to be done before next week-end.

So far, from what I can see on the logs, 2 clients are running. One here on the dev machine, and another one, which is also me 😉 So feel free to download the application and run it if you want to participate.

I’m not publishing any version today because the client is not working any more because of the server modification. So please us the version available in the Download section.

And to conclude, here are the statistics after almost one week of parsing. The crawlers retrieved 2944614 different URLs and found 26169 domains. 888279 of the retrived links are already parsed.