Today we can find many distributed applications. Distributed computing like folding at home, distributed search engine like Majestic-12, distributed file shared like the common peer to peer software, distributed free browsing like freenet, and so on.
Big advantage of a distributed application is that you can have access a lot of computer power, disk space and bandwidth as long as you can convince some people to join your project.
The first goal of DistParser project is very simple. It’s simply to try to put in place a distributed application to crawl the web and retrieve some statistic from it. When the crawler framework will be in place, many kind of results can be get from the engine.
I have decided to not simply re-do another search engine like those existing today. If someone want to search by keyword, he will simply go to google… And to do like google, I will most probably need to have the same infrastructure and the same budget… which I don’t have. By not retrieving/storing the keywords yet, the server side data load will be reduced. So in order to be able to produce useful results, the application will have to retrieve many characteristics from the page, and store them for reporting.
The goal of this project is to put in place a multi-platform distributed web crawler to retrieve statistics from the web overall. And reduce to the minimum the transfers between the clients and the server. More feature will be added later when the framework will be strong enough.
The number of pages on the web was estimated in 2011 at 1 000 000 000 000. If we say that each URL is about 64 characters long, that mean that to store all the URL names, I will need almost 60TB… And this is without considering all the other page characteristics stored on the server. And bandwidth to retreive all those URLs will be even biger.
So I have decided to start the crawler with a subset of the domains to reduce the size of the test and ensure its success. If I’m able to get some useful statistics from this test, I will most probably start to extend the list of domains to include a bigger subset of all internet websites.
Starting with all .net, .com or .org domains will most probably represent to much data to retrieve. So I have arbitrary decided to start with only all .ca domains to reduce size of the target.