How to discard looping pages part 1.

I don’t even know when this will last, so I can already titled it “part 1″ since I’m sure there will be more to come.

The idea here is to avoid pages loops. What I mean by page loop is a page which point to itself but where the URL and/or content is slightly different. There can be multiple kind of loops. Lets try to see some and see how (if possible) it can be detected and fixed.

Kind of loops

Loop in parameters

The first kind of loop I figured is the one based on the URL parameters.

Like http://domain.com/url?p=false is a page which contain two links which point to the page URL adding “p=true” or “p=false”. That gives you a link to http://domain.com/url?p=false&p=false and a link to http://domain.com/url?p=false&p=true .

Since it’s pointing to the same page, that will, again, build 2 new links with 2 new URLs pointing to the same page.

Loop in the path

The same way you can have loops in parameters, you can have loops in the path. Let’s imagine someone configured /url as an application on his server. This application is building a page and is putting links into it by adding /xxx and /yyy to the current URL. You will end up with a /url page containing /url/xxx and /url/yyy links. If you follow the first link, you will be redirected to a page which contain /url/xxx/xxx and /url/xxx/yyy and so on.

Loop with different URL

I figured this 3rd kind of loops only yesterday. It happend when a page is referencing itself, without adding any parameter or path, but simply by changing one parameters value. Ir can be a session id if your cookies are disabled, but it can also be a timestamp. Like http://domain.com/url?timestamp=123456789 display a page where you have a link to /url. If the page automatically add the timestamp to the links, you will have another link to http://domain.com/url?timestamp=123468126 in this page. You will land on the same page, displaying the same content, but with a different URL.

Loop detection

Loop in parameters

This kind of loop is already detected and corrected by the crawler. Basically, all the parameters are retrieved and only one occurrence of each is putted back to the URL. So at the end, all the duplicates are removed, which reduced the list of possible pages to a minimum. So webcrawlers seems to be “simply” removing the parameters.

Loop in the path

Like many other web crawlers are doing, we can detect a loop in the bath using a simple regular expression to detect if there is more than 3 times the same sub-directory in the URL. Like http://domain.com/a/b/c/b/c/b/c is containing 3 times “/b/c” and so will be discarted.

Loop with different URL

Some existing website have similar issues which might cause some troubles to the bots. A very good example is got “About Amazon” page.  The link to it (sorry, it’s a bit long) is: http://www.amazon.com/Careers-Homepage/b/ref=amb_link_5763692_2?ie=UTF8&node=239364011&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=left-4&pf_rd_r=03WW7BQHKJ96NQ6YJE8M&pf_rd_t=101&pf_rd_p=1337714942&pf_rd_i=239364011.If you click on this link, you will land on the amazon page. If you look at the left, you have a link to the current page you are browsing under the name “About Amazon”. Open this link in a new window. And now compare the 2 pages. First, you will notice that the content is different. Not the same articles displayes at the bottom, so not the same page content. Now, look at the URL. It will be almost the same as the previous one, but with a different pf_rd_m parameter. So for the bot, it’s a new content, with a new URL, so it’s a new page. Wrong. It’s the same page. But there is almost no way to detect it. And the crawlers will read them, follow the links, read the new pages, and so on, forever, until you find a way to deted that. SESSIONID parameters and other jsession parameters are causing similar issues by referencing the same page multiple times with different URLs. I searched over the web and read a lot about that the last few days but seems there is no real good solution to that. However, few filters can be put in place to reduce the amount of duplicates. The first one is to remove from the URLs known session parameters like jsessionid and others. The second is tu hash the page content and discard it on the server side if another page on the same website has the same hash code. Even with those 2 filters, amazon URL below will still be considered as a new page. Removing all the pages parameters like some crawlers are going will solve this issue, but that will also discard to many pages, like php forums threads, etc. So I wil have to think about some additionnal filters to prevent such pages to be retrieved.

Other issues.

Even if it’s possible to detect many kind of loops, there is some loops which will never be detected. Someone can build a webcrawler trap (a.k.a. bot trap). It’s as simple as an application which handle all the requests to a server, and serve them with some random link and content into it, pointing to the same domain name. Since both the URL and the content are always different, you will have no way to identify that your bot is traped and calling always the same a single page/application.

 

New server architecture in place.

It’s done!

There is now 6 servers serving workloads and receiving the results. The time to retrieve and submit data might be constant what ever the size of the database will be.

The database now contain XXXXX entries and is used at only 2.07 % of its capacity. If more capacity is required, it will be easy to add some servers/disk space to increase it without impacting the running applications.

I’m very happy with the results. I will now most probably add some load-balancing in front of the application server to split it between 2 different servers.

The total response time for both get load and submit results is now always below 2 seconds. From that, about 50% is because of the client to server network transfer.

Now  that all this new architecture is in place, I will be able to work back on the crawler algorithm.

Server architecture

The more lines we have on the server side, the slower the server is…

There is currently 13 038 474 URLs in the system. With this data size, it’s taking only 726 milliseconds to provide workload to a crawler which is asking for more work, but it’s taking 47 seconds to commit the results… It was taking one second when the table was 2B lines big, 10 seconds when it was 5B lines, etc. The bigger the able is, the slower the results are.

As a result, I worked more actively on the new architecture. The data retrieval is still taking few milliseconds and should be a bit faster, but the data commit is now taking less than 2 seconds to commit the data in a 40B table! It was taking 2s with a 1B table, it’s still taking 2s with a 40B table and this time will be constant. There will be no degradation of the response time even if the load is increasing. The final goal will be to reduce the data commitment to less than a second.

So based on this statement, there is no much modifications I have done on the client side. You can still download and run the current version available on the download section.

Performances: done!

A lot append those last few days on the DistParser application.

First, again, about performances. The server is now able to serve new workload and handle results within less than half a second (About 200ms). This is WAY faster than the 40 seconds response time from a week ago. I don’t think there is much more thing which can be improved for now. When the size of the data will grow, I will move to a clustered architecture. But for now, the regular database can still handle the load. So I might not have to talk about performances for some time.

Client side, I made many updates to the application.

Everything from the last todo has been done! And even more.

  • Robots.txt query and parsing;
  • Robots.txt caching;
  • Display daily and monthly total for the bandwidth;
  • Limit the instant bandwidth usage;
  • Limit the monthly threshold;
  • Limit of download/upload speed;
  • Save the bandwidth usage when the application is closed to have a daily total;
  • Validate URLs format;
  • Add logs.

First, regarding the robots.txt rule. The crawler is now downloading the robots.txt file before querying any URL. The required URL is validated against this file. Also, to reduce the bandwidth usage, this robots.txt file is stored locally on a disk cache. The cache parameters are not yet configurable but will be in a near future.

Regarding the bandwidth usage, the application is now keeping track of all the bandwidth used, including upload and download, and is displaying the daily usage and the monthly usage in the title bar. Monthly usage billing period can be configured to start on a specific day of the month. Also, daily and monthly total usage limitation are now in place. So you configure the tool correctly, you can let it run as often as you want without risking of going over your provider allowed bandwidth. The bandwidth is stored in a local file. So each time you are closing the crawler, the information is stored. Each time you are opening it back, information is restored. So you will always see accurate numbers.

On the database side, since there was no URL validation, many malformed URL were stored and are generating crawlers work which can be avoid. So an URL format validation has been added. A database cleanup has been made, which has reduced a lot the number of URLs stored. Also, a defect on the storage side has corrupted some URL identification. I had to delete those entries too.

As a result, the database now stores 2 212 550 URLs

Many other small updates/improvements were made too, like the add of the logs to figure what the application is doing.

If you which to participate, don’t hesitate to download the Crawler and run it. You can see the “How to start?” section.

So for the next release, here are the new features I would like to add:

  • Allow robots.txt disk cache size configuration;
  • Build an interface to see/update parameters;
  • Build an interface to see crawler statistics (Cache info, bandwidth, etc.);
  • Add a graph with the number of pages retrieves/parsed and the domains found;
  • Improve Domain vs URL split rules (to handle ? split when ? not present);
  • Rename application packages.

Performances, again and again.

Retrieving data on internet quickly is fine, but if it’s taking twice the time to store it, it’s useless…

Over the last few days I worked a lot on the performances improvement. I was able to make good progress, and I have good plan for the future.

I have identified 3 areas of possible improvement in the actual architecture.

  1. The disks access are very slow;
  2. The database inserts/updates are slow;
  3. The network can be improved.

To address the disk speed issue, the database just been moved to a SSD drive. Access time is about 100 times faster, and transfers rate 10 times. This will be a big improvement. Also, I have modified the way data is inserted into the database. And last, the server network connection will be changed from wireless to wired.

With the 2 first modifications, the time for a client to submit results went down from 40 seconds to less than 1 second. And from that second, 50% of it is used by network transfers. Which mean only half a second is required to submit data load and query new workload.

For the future, I’m already working on a distributed server side database to store and process the data even faster. But this is more for long term. As long as the actual database if able to serve all the crawlers requests, I will keep it in place.

So now I will most probably start to work back on the client from the last ToDo:

  • Robot.txt query and parsing;
  • Display daily and monthly total for the bandwidth;
  • Limit the instant bandwidth usage;
  • Limit the monthly threshold;
  • Add a graph with the number of pages retrieves/parsed and the domains found;
  • Save the bandwidth usage when the application is closed to have a daily total;

Tomorrow I will try to post an update version. So feel free to download, install and run the application.

Regarding the statistics, there is now 4 205 860 links into the database from 59 393 domains and 897 928 and those pages got already parsed.

4205860

Database performance issues part 2.

I spend the few last days working on he database performances, and it’s till not meeting my expectations. I was able to improve it enough to re-open the server, so you can stat again to run the clients.

However, this will still need a lot of improvement. I’m looking at a completely different database engine/schema to store the data. But that will take time to implement. At lest one week. So in the meantime, I will continue to tweak the existing database a bit and let the clients run.

More to come.

On the other side, I’m now thinking about what can be done with the data retrieved. Maybe the crawler will need to retrieve a bit more information from the pages.

Database performance issues.

Yesterday the number of URLs stored into the database almost reached 4M. But the issue is it’s now slower each time an insert is done. And not just a bit slower, but WAY slower. I tried to do some improvement by changing the database structure or even the database engine but it’s still the same.

To insert on the server the result of 500 pages been parsed, it’s now taking 40 seconds. Which is not acceptable, since it’s taking almost the same time for one client to parse same number of URLs.

When inserting the load into the database, CPU is used at less than 10%, but the hard-drive is used at 100% for the 40 seconds. So there might be some place for improvement on that side.

So I now have few options:

  • I rework again on the database structure to find what’s wrong with it.
  • I change the database engine for another one (again)
  • I completely change the database system I’m using
  • I improve the hardware.

In the meantime, I stopped the server to not overload it. I hope to be able to restart it in the new few days. I will try to keep one client running locally.

Stream parsing improvement.

First thing, you will notice that I have decided to be more creative for the titles 😉 I will try to use as a title the main subject of the post.

Over the week-end, I worked on the server performances. It was way to slow.

First, I moved the server and the database to a new server about 8 time faster than the previous one with 8 times the memory. That will help to increase the number of supporter crawler. I have estimated 60 participants to be able to be handled by the server. Which mean I can now expect the server to be able to handle up to 480 participants. Also, I worked on the part of the code which handle the stream received from the crawler and prepare it for treatment. The initial performance on the dev machine was 94315 pages handled per seconds. After optimization, it’s now able to handle 366709. I looked at the code so many times that I think there is no other way to improve it. The size of the stream transferred as been reduced by only few bytes, but at the end it’s almost 4 times faster. Which is a very good improvement.

On the other side, I build the migration procedure to migrate the existing database to the new schema. It’s taking about one hour to process the table. So there will be a one hour down time for the server sometime in the middle of next week. I’m also expecting a big improvement on that side. For now, it’s taking up to 3 seconds to submit the 500 pages parsing results, and retrieve more work. I will see what will be the improvement but my goal is to go under one seconds for the total of the two operations.

Regarding the ToDo from the last post, here is what still need to be done:

  • Display daily and monthly total for the bandwidth
  • Limit the instant bandwidth usage
  • Limit the monthly threshold.
  • Add a graph with the number of pages retrieves/parsed and the domains found.
  • Save the bandwidth usage when the application is closed to have a daily total.

All of that is on the client side. I will first need to complete the server side improvement before I take care of that. So I’m not expecting any of those items to be done before next week-end.

So far, from what I can see on the logs, 2 clients are running. One here on the dev machine, and another one, which is also me 😉 So feel free to download the application and run it if you want to participate.

I’m not publishing any version today because the client is not working any more because of the server modification. So please us the version available in the Download section.

And to conclude, here are the statistics after almost one week of parsing. The crawlers retrieved 2944614 different URLs and found 26169 domains. 888279 of the retrived links are already parsed.

Today’s status

Today I did not get a chance to work on what I have decided yesterday. There is now 2697274 URLs in the database and 25107. 893001 pages already got parsed. The problem is that now the server performances. It takes 40 seconds for the server to process the results and almost 20 seconds go generate workload. When it’s taking sometime 30 seconds for a client to parse an entire workload, that mean on single client will be faster than the server… Useless. So I had to migrate the application to another server. The previous server was an old 1.2Mhz fanless computer with 1GB only. The new server is way more efficient and might be able to serve at least 60 clients. It’s better, but still not enough. So in the meantime I will have to take a look at the database to see how I can improve the load. I worked on that few hours today with no results. I will try to spend some more time on that over the week-end.

I also got some time to think and work on a cycle detection. The one I have put in place is quite nice. It’s doing a nice job on many entries and reduced a lot of work. I figured that there might be many duplicates in the database. Maybe I will have to reset it. Anyway I have few ideas into my head to optimize all of that and some of them will required a server data structure update… So not sure I will be able to keep the actual data, but I will try very hard to! More to come…

You can still download the client. I have updated to version 0.0.4b. It’s now very fast. so be careful with your bandwidth usage.

So from yesterday’s goals, here what has been done:

  • Ensure the application title is always refreshed even when sending results or loading new workloads.
  • Continue to work on cycle detection.
  • Improve the server performance.

And here what is still to be done:

  • Display daily and monthly total for the bandwidth
  • Limit the instant bandwidth usage
  • Limit the monthly threshold.
  • Add a graph with the number of pages retrieves/parsed and the domains found.
  • Improve the server performance. (Need more improvement)
  • Save the bandwidth usage when the application is closed to have a daily total.

2697274

Today’s status

Are usual, some updates for today. But not so much.

First, the results. We are at 1 358 312 links found in 18 753 different domains. 629 909 pages got already retrieved and parsed.

I tried to work on cycle detection for the URLs with no success. The perfect example of the need for a cycle detection is the page http://www.abilities.ca/agc/disclaimer.php where there is a link inside (view source) which add “&screenreader=on” to the URL. When you click on it you are redirected to http://www.abilities.ca/agc/disclaimer.php?&screenreader=on where you have a link inside (view source) which add “&screenreader=on” to the URL. When you click on it you are redirected to http://www.abilities.ca/agc/disclaimer.php?&screenreader=on&screenreader=on. I think you can imagine the rest.

One way to detect can be to start from the end, read the parameter, and see if the one before is the same. But what if there is 2 parameters? Like http://URL/path?&p1=a&p2=b&p1=a&p2=b. The last 2 parameters are not the same. Ok, I can search for part of the string, like “p1=a&p2=b”. But what if the order change? http://URL/path?&p1=a&p2=b&p2=b&p1=a will not be detected. Another solution is to parse all the parameters, remove the duplicates, and put them back. Ok, but if the URL is not using standard delimiters? Like http://URL/path?&p1=a;p2=b;p1=a;p2=b… It’s even worst when there is malformed URLs. And so on.

There is so many possibilities that I came to the conclusion that a perfect cycle detection is, unfortunately, almost impossible.

So instead of wasting time for a very complicated cycle detection, I will implement a simple one. I you want to help and provide one, you are welcome! I will just parse the parameters if there is any and if they have a standard format, and remove the duplicates if any. If I found any “strange” character or format, I will simply skip the cycle detection and continue with the URL. At some point anyway, the URL will be trimmed to its maximum length. But if there is more than one link like that in a page, this can end to thousands of the same page retrieval.

I was also supposed to work on the bandwidth usage. I have started the work. So far, the total bandwidth used for upload and download is displayed, and so is the instant bandwidth usage. Instant bandwidth usage is the average bandwidth used for the last 60 seconds, so it’s displayed only after 60 seconds. Even if I tried to keep this information as accurate as possible by including headers size, URL sizes, calls the the DistPaser server, etc. there might be some communications that I’m missing and which at the end are missing from the total. So this can be used to have a good idea of the tool bandwith usage, but you might still need to track your bandwidth usage from your provider just to not have any bad surprise. Over the next few weeks I will try to compare what the tool is giving me with my real bandwidth usage and see if I’m close.

I have updated the main loop of the tool to reduce the CPU usage. Previous version (0.0.2b) was using 100% of the CPU to refresh the display. Now the display is refreshed only every seconds and the entire application is using only less than 2%.

I have also put in place a configuration file which you can found in your user directory. This configuration file contain the number of concurrent crawlers you want to use and the number of workload you wand to prepare. There is no validation made for those values, so using out of range values might cause the application to crash.

One last thing. The title of the application is displaying information about the work beeing done. From left to right, you will fine, for each workload loaded “Number of URL to retrieve”/”Number of crawlers” followed by |. Then you have upload/download (in bytes) and then the bandwidth instant usage.
Again, version 0.0.3b is available for download. Feel free to download and run it.

Todo for the next days:

  • Ensure the application title is always refreshed even when sending results or loading new workloads.
  • Save the bandwidth usage when the application is closed to have a daily total.
  • Display daily and monthly total for the bandwidth
  • Limit the instant bandwidth usage
  • Limit the monthly threshold.
  • Add a graph with the number of pages retrieves/parsed and the domains found.
  • Continue to work on cycle detection.
  • Improve the server performance.