When you think about crawling the entire web, you might first think it’s something pretty easy and straight forward. Open a page, read all the links, close the page. Store the results and page information you want to keep. Then open all the links, and so one.
Basically, that’s the idea. Easy. But the more you will go in this adventure, the more you will found issues and specific cases landing to situation which might cause troubles to your application.
I already discussed about the loops and duplicate pages. Like 2 pages in the same domain with different URLs serving the same content. A good example is where you have a session ID on the URL. There is some options to detect and filter them. But that’s onl if it’s inside the same domain. But what if 2 domain names are serving the exact same content? Like www.domain.com serving the exact same content as www.niamod.com. You can’t verify all the pages you already retreived to confirm if this is a duplicate or not. so at the end, you will still have some duplicates, and I don’t think there is any technical solution do avoid them.
Before loading a page, you have to validate the robots.txt file top figure if you are allowed to parse this page. Of course, you can ignore that, but it’s a good practice to read it first. Now, what if the robots.txt file format is not correct? Or what if the url to retrieve the robots file if giving you an error? Are you going to allow the crawler to read the page? Or it’s better to avoid it? When the crawlers are already working, I even found a robots.txt file bigger than 4MB… Which is bigger than the limit I have fixed for all pages in the crawler. So robot file has been discarded. At then end, there is many robots.txt file you will never be able to find, retrieve, download or even parse.
Now, you think this page is not a duplicate, the robots.txt file allow you to retrieve it, but when you are loading the page, the webserver is sending you an HTTP 419 error code. As you will figure if you search for this page, this error code is not a valid HTTP error code. Here is the list of error codes. So what’s to do with the content of this page? Is the error code just a mistake and the page is correct? Or is this an error page? For DistParser, I have decided to discard pages where error codes are not standard.
So you got a link, your looked at the robot file, your called the page and retrieved correctly. You know need to parse it. You might face many issues on the parsing side. First, HTML might not be strict and you might have some troubles toparse it. Page can also be sent as an HTML page and not be an HTML page. And when you will found links into the page, they might contain characters invalid in URLs, like pipes (|), spaces, etc. You can do some cleanup, but how can you be sure you have handle all the possible failures on the format? It’s very difficult to be 100% sure. One option is to check against a specific format and discard everything else, but that way you might miss some URLs. So you ill have to take a decision there. For DistParser, I have decided to clean the URLs as much as possible but keep them all to be sure to retrieve as much links as possible.
Last, you will need to think a bit about the data size of what you are going to retrieve and store. Let’s imagine where is an average of 32 links per page. Just go to amazon.com and count them, you will find more than 100. For each link, we need to store, at least, the link, and the associated keywords. A link is on average 32 bytes long. And for keywords, let’s say we will store almost the same think, 32 bytes. That mean, for one average page, we will have 32*32*32 = 32kb… Today, on the database, there is 3154294 URLs proposed to the crawler for parsing. This represent 96GB of data to store. Add the replication level to secure your data, the database overhead, etc and you will quickly end to many TB even with only 3M URLs. In 2008, google stored more than 1,000,000,000,000 URLs and it was 4 years ago. I let you imagine what that represents in GB. So be prepared and scale your system based on what you want to do with it.
As you can see, there is many issues you will face all along the road. And there might even be issues I have not faced yet. I initially though this will be an easy project to put in place, but I now figure it’s a big more difficult than expected.
However, with the server side beeing more stable and robust, and with all the past improvements on the client side, I think I’m close to have the first release of DistParser. The next step will most probably be to work on the data processing and see what can be extracted from the database.