Crawling issues you should expect

When you think about crawling the entire web, you might first think it’s something pretty easy and straight forward. Open a page, read all the links, close the page. Store the results and page information you want to keep. Then open all the links, and so one.

Basically, that’s the idea. Easy. But the more you will go in this adventure, the more you will found issues and specific cases landing to situation which might cause troubles to your application.

I already discussed about the loops and duplicate pages. Like 2 pages in the same domain with different URLs serving the same content. A good example is where you have a session ID on the URL. There is some options to detect and filter them. But that’s onl if it’s inside the same domain. But what if 2 domain names are serving the exact same content? Like www.domain.com serving the exact same content as www.niamod.com. You can’t verify all the pages you already retreived to confirm if this is a duplicate or not. so at the end, you will still have some duplicates, and I don’t think there is any technical solution do avoid them.

Before loading a page, you have to validate the robots.txt file top figure if you are allowed to parse this page. Of course, you can ignore that, but it’s a good practice to read it first. Now, what if the robots.txt file format is not correct? Or what if the url to retrieve the robots file if giving you an error? Are you going to allow the crawler to read the page? Or it’s better to avoid it? When the crawlers are already working, I even found a robots.txt file bigger than 4MB… Which is bigger than the limit I have fixed for all pages in the crawler. So robot file has been discarded. At then end, there is many robots.txt file you will never be able to find, retrieve, download or even parse.

Now, you think this page is not a duplicate, the robots.txt file allow you to retrieve it, but when you are loading the page, the webserver is sending you an HTTP 419 error code. As you will figure if you search for this page, this error code is not a valid HTTP error code. Here is the list of error codes. So what’s to do with the content of this page? Is the error code just a mistake and the page is correct? Or is this an error page? For DistParser, I have decided to discard pages where error codes are not standard.

So you got a link, your looked at the robot file, your called the page and retrieved correctly. You know need to parse it. You might face many issues on the parsing side. First, HTML might not be strict and you might have some troubles toparse it. Page can also be sent as an HTML page and not be an HTML page. And when you will found links into the page, they might contain characters invalid in URLs, like pipes (|), spaces, etc. You can do some cleanup, but how can you be sure you have handle all the possible failures on the format? It’s very difficult to be 100% sure. One option is to check against a specific format and discard everything else, but that way you might miss some URLs. So you ill have to take a decision there. For DistParser, I have decided to clean the URLs as much as possible but keep them all to be sure to retrieve as much links as possible.

Last, you will need to think a bit about the data size of what you are going to retrieve and store. Let’s imagine where is an average of 32 links per page. Just go to amazon.com and count them, you will find more than 100. For each link, we need to store, at least, the link, and the associated keywords. A link is on average 32 bytes long. And for keywords, let’s say we will store almost the same think, 32 bytes. That mean, for one average page, we will have 32*32*32 = 32kb… Today, on the database, there is  3154294 URLs proposed to the crawler for parsing. This represent 96GB of data to store. Add the replication level to secure your data, the database overhead, etc and you will quickly end to many TB even with only 3M URLs. In 2008, google stored more than 1,000,000,000,000 URLs and it was 4 years ago. I let you imagine what that represents in GB. So be prepared and scale your system based on what you want to do with it.

As you can see, there is many issues you will face all along the road. And there might even be issues I have not faced yet. I initially though this will be an easy project to put in place, but I now figure it’s a big more difficult than expected.

However, with the server side beeing more stable and robust, and with all the past improvements on the client side, I think I’m close to have the first release of DistParser. The next step will most probably be to work on the data processing and see what can be extracted from the database.

New release

A new release is available on the Download section. Few small improvements detailed below are coming with it. Feel free to download and use it. If you have any issue running it, please report.

Version 0.0.6b improvements

Client side:

  • Ability to disable bandwidth speed limitation by setting it to 0;
  • Ability to disable daily and monthly bandwidth limits by setting it to 0;
  • Handling of more http code errors;

Server side:

  • Modification of the data structure to improve data handling speed;
  • Improvement of the duplicates detection.

Many small improvments.

While I’m still working on the loops and duplicates detections, I’m still working on improving the crawler and the servers.

Here is a list of the last updates done.

  • Some update done on the duplicates detection. Working quite well but still need some more work.
  • Data send by the crawlers is now stored separately in order to be controlled and parsed before being sent for parsing.
  • Links destination and source are now stored in the server and not only the URLs. Also, place had been reserved to store the keywords related to the links.
  • On the client side, a 30 seconds delay has been added between 2 calls to the same server to avoid being banned from those servers. I personally got banned from Amazon because the crawler was calling them to quickly (no delay).
  • Duplicate pages are reported so some analysis can be done.
  • URL filtering/parsing has been added in both the client and the server side. On the client side to reduce the number of links sent to the server. On the server side to ensure new filters can be added quickly even if all the clients are not updated.
  • Application is now able to crawl and parser servers on different ports (Not just 80) and protocols (not just HTTP).
  • Robots.txt retrieval and parsing had been improved.
  • Since database can now receive multiple terabytes of data, the filter on .CA urls has been removed.
  • Crawlers are now provided a way better random generated list of pages to parse.

And here are the next updates I’m working one:

  • On the client side, store the last 1000 failing URLs to have an idea of the network issues occurring with the crawler.
  • On the server side, analyze the duplicated pages to understand if there is any pattern (URL parameter, loop, etc) that can be detected and added to the filters.
  • Add parsing of of document format (PDF, etc.)
  • Start the work on the page ranking.
  • And still all the required improvements on the client side (User interface to update the parameters, etc.)

All the last modifications I made are compatible with the existing client, so there should not be any issue with the running applications. Regarding the work already done as of today, I had to reset the data. Because of the lack of loops/duplicates detection, there was more than 60% of the content duplicated. As a result, the data quality was not good enough, and cleaning it will have take more time than what it’s going to take to read it again.

So the number of pages retrieved come down from 12M to only 3000 entries but will grow up very quickly. A statistic page will be put in place soon to provide live information about the data size.