Almost a year….

It has been a while since I have updated the website and I apologize about that.

As you can see on the progress chart, I lost an important server few weeks ago which has impacted the tool availability. It’s now resolved and parsing is back online. I have also done many small modifications on the client and server sides to fix small issues here and there.

We will soon have more than 100 millions pages into the system, which is nice. The 1 billion might take a little while, but we are going into the right direction.

On the hardware side, I will privision a new server before end of April 2014. Most probably 8 cores, etc. and will replace a 2 cores server.Again, if you want to participate (harware donation, running the crawler, etc.) feel free to contact me.

I will try to upload the updated client soon.

Some news…

It has been a long time since I gave you some update.

Just to show you that things are still in progress, here are some statistics.

We currently have more than 24 millions of pages retrieved, and more than 500 millions of links waiting to be retrieved. One graphic has been prepared to display the pages progress. It’s not fully finished but you can see a preview there.
Pages retrieved

Also, we are currently working on an export tool which will aggregate the current data we have and list a tree of all websites already parsed by the application. More to come soon…

New release 0.0.9b

I got the time to work a bit more on the client those last few days, and therefor, here is an updated release. Just click here to go to the download page.

Here are all the improvements for this release:

  • Improvement of the URL format validation to reduce the number of calls to wrongly formated URLs;
  • Introduction of the -ui parameter. Calling without this parameter will start the application in command line only (See 1) below);
  • Now dynamically display the DistParser UI based on the number of threads configured;
  • Page language detection has been added in the client side (EN and FR so far);
  • Implementation of lateral caching for robots.txt files (See 2) below);
  • Implementation of commands  (stats, exit, save) (See 3) below);
  • While waiting to exit the application, logs are more verbose to show the exit progress;
  • Improvement timeout management. This will reduce the delay between when you ask the application to exit and when it really closes.
  1. The distributed crawler now comes with an option to display or not the user interface. If you still want to have the graphical user interface, simply add -ui in the command line. Else, it will be hidden. When the UI is hidden, you can exit the application simply typing “exit” + enter. If you want to keep the application running in background, you can start it using “screen”.
  2. Lateral caching has been adding too. It’s useful with you are running more than one client in the same network. For each retrieved URL, the client need to validate against the robots.txt file that we are allowed to read this file. If we don’t have the robots.txt file, we need to download it from the server. Lateral caching allow to share this file with the other clients in the same network, reducing the bandwidth usage if another client need to retrieve the same file. Robots.txt file with be taken directly from the cache. The client is configured to keep 128K different robots files. Robots files expire after 7 days to make sure we get recent data. If you stop one of your crawlers, you might see some exceptions in the logs because the cache will lost a node. Just ignore them.
  3. You now can use few commands to control the crawler. You can use those commands when you are starting it without the user interface, but you can also still use then when using the graphical interface. So far only 3 commands are available, but more will come. You can you the “exit” command to close the application (need to complete the current work first). You can use the “save” command to force the application to save the data, and finally you can use the “cache” command to display the cache status.

Also, on the server side, some cleanup has been made in the database to remove malformed retrieved URLs too. Therefor, the number of links waiting to be parser is now down to 150 402 200. Some improvements took place in the servers configuration too.

New release

It has been a while since I have post a new version. So here we go.

I just uploaded version 0.0.8b in the download section.

Before you try it, please make sure you turn your queue size to 0 and you wait for all the current work to be done and sent. The current file format and the new one are not compatible.

Since new release is introducing many improvements:

  • Fixed a memory leak issue;
  • Workloads are now only on the disk and no more in memory;
  • Results are stored on the disk and no more in memory;
  • Workload load and save had been greatly improved to reduce load and save time;
  • JAR file size has been reduced by remove un-used libraries;
  • A protocol is now sent between the client and the DistParser server;
  • Overall memory usage has been reduced.

So feel free to download and test this updated version. In a soon release, user identification will be added and previous version might not be working anymore.

Statistics

I promised you some time ago to share some statistics regarding the data we currently have. Since the crawlers are (almost) always running, this is just a snapshot and by the time I commit this article, it should already be different.

So as of day, there is 155 152 313 links proposed to the distributed crawlers! All those links are extracted from the 8 426 912 pages we already have received and processed.

The process performances to parse a page was running at 268 pages/minute when the project started some time ago. With some new improvements and new servers, it’S now processing more than 800 pages/minute. The goal is to reach 1000 page/minute in the next few weeks.

Today, the number of pages waiting to be processed is 0 since we are processing the pages faster that what they are submitted.

We still have in mind the project of adding some RRD graphs to display the size of the tables. We might have to time to complete that in the next few weeks too. Today’s efforts are concentrated on the client side where we found some memory leaks preventing the client to run for more than few days in a row.

As soon as this memory leak is identified (we found and fixed one, seems that there is another one), we will post the new client. It will be coming with a LOT of improvements. We will talk about that in another post.

Data structure

Here are some details regarding the way the data is stored in the backend, and how it’s processed.

The goal of this model is to keep things simple to avoid unnecessary processing and facilitate the data manipulation. As I already said many times, the content retrieved from the web cannot be taken for granted. It can be corrupted, duplicated, malformed, etc. So before the results send back by the crawlers is integrated to the main database, some processing is required to control it.

Basically, there is 3 main tables in the system. On table called work_proposed simply contains all the links retrieved and approved, proposed to the client crawlers. Another one called page, which simply contains all the pages already parsed, received and approved, including the links, keywords, etc. And finally, page_proposed which will store all the pages loaded by the crawlers and sent back to the server.

Now, let’s see how the data is flowing between all those tables. First, the crawler needs to get some workload. This is retrieved from the work_proposed table. This table only contains URLs that need to be downloaded from the Internet, parsed and sent back to the server. When the page is sent back to the server, it’s stored into the page_proposed table. The is nothing more than just the URL stored there. No need of any extra argument. A separate process will parse the page_proposed table to validate the entries. Each entry is first removed from the table. It’s then checked for duplication from the page table. If this is a new page never retrieved in the past, or if it’s an existing page but the content changed, the page is added to the table, and all the links it contains are added to the work_proposed table, if not already there or not already existing in the page table. The diagram below show how the links/content is flowing through the different tables and process. It’s also showing the table size as of today. Since there is an average of 140 links per page, if all proposed pages are correct, more than 600 000 000 links might be added to the work_poposed table. This will be run soon and the results are going to be post here. So to estimate the number of pages already processed, we have to add the numbers of entries from the page table and from the page_proposed table. And to estimate the number of pages still to be retrieved, simply need to count the entries from the work_proposed page.

Data structure and flow.

Data structure and flow.

Bandwidth…

Let’s talk a bit about the bandwidth usage and the related bandwidth management.

First, let’s try to describe what bandwidth is used by the tool.

Each time the client queries a web server, few bytes are sent are the request. In return, the server will send back some data to the client.

Calculate how many bandwidth is used is easy. We just have to add the side of the packets sent to the size of the packet received and we have the bandwidth used.

Now, this is on a perfect world. And again, the web is wide, and you will find, unfortunately, many other situation which are making it a bit more difficult to figure what bandwidth exactly has been used.

Let’s take few examples.

We don’t want to retrieve, yet, non-text content. Let’s imagine an url like http://domain.com/path_to_something. We don’t know yet what the content is. The only way to know that is to call it. So you send few bytes as the request header, and get a response. From the response, you get on the headers that it’s a binary file which is 600M size. So of course, you don’t want to retrieve it, and you send a cancellation. But now, have you start to receive some of the file content? May be. So you can count it, but between the time you sent the cancellation, and the time it really stopped, how many packets have you received from the servers? You don’t know. And you don’t have any easy way to know it.

Another example. The server you are calling is very slow. The time out is set to 20 seconds. After this period of time, you send a cancellation request, but just at the same time the server reply. Too late. You cancelled the request. So you will get some of the content again, plus headers that you will not take into consideration on your total.

So if you let the client run for few hours, at the end, with all the issues it might face on the web, there are high chances that some content was miss count.

I tried to build the client as accurate as possible, but the only way to get really accurate bandwidth usage is to track it on the TCP level, and not at high level like what I’m doing.

Since this information is not critical for the application, I have decided to keep it the way it is. I will do some small adjustments to take cancellation calls into consideration, but you can expect the displayed bandwidth usage to be a bit different than what your service provider or your tcp dump can display.

The instant download speed limitation is based on this value to. So far, I have not seen any difference bigger than 10%. So the limitation is still pretty accurate and useful.

One million

It’s done! System reached its first million!

1,169,265 pages got already parsed from internet with links, keywords and other useful information. From those, 769,636 are already processed for duplication. From those pages, 15,521,162 are still to be loaded by the clients.

Slowly, but surely, database is growing.

For the work in progress, I’m currently working on the servers machines. I’m planning to had 2 new servers and remove one which is to slow.

I have also started on the automated reporting to display automatically on this site the number of pages processed and the number of links pending for retrieval.

Let’s see when we will reach the 2 millions…

Some new improvements.

I just uploaded a new version of the client. You will find it on the download section.

This new version is introducing an “exit” button which allow you to close the application nicely. This is stopping all the threads and it’s waiting for them to parse the links from the page they are working on. Parsing the links can take some time since the crawler need to retrieve the robots files to know if we can add those links into the database or not. Some pages have more than 2000 links… As a result, this can take some time (sometime up to 5 minutes).

So when you click on exit, the button become “Force” and will allow you to force the exit of the application by even stopping the link parsing for the pages retrieved. This will accelerate the application closure, but since timeout for the connections is set at 45 seconds, this migh take up to 1m30. (45 minutes for the socket connection and few retries). The links currently opened will be lost, but they will be re-assigned in a near futur when they will timeout on the server side too.

However, there is one situation where the closure can’t be accelerated. This is when the crawler is communicating with the server to send the results or retrieve some workload. Based on the server usage, your internet bandwidth and the size of the data beeing sent/retrieve, this might take some time. A green light indicates that no communications are in progress. But when the light is turning red, that mean the crawler is communicating with the server and can’t exit now. You still can clic on “exit” while the client and the server are communicating, but the crawler will not directly exit. It will have to wait for the transfert to be done.

Those few improvements are made to give you a cleaner way to close the application, but also to make sure that no work is lost by mistake when wrongly closing the application.

Few other improvements were also embeded in this new release.

First, the application startup speed has been improved. Some more work will still be required there but it’s better.
The application instant bandwidth limitation is more accurate than before.
You now can use 0 as the bandwidth limite to disable it.
Current workload restore and save format got improved. Will avoid some failures and workload lost.

Here are the next features I will include on the next client:

  • User identification;
  • User interface to modify crawler options;
  • Allow user to select preferred domain extension (.com, .co.in, etc.);
  • Client version check again server version.

And the todo list on the server side:

  • Add a public form to check if a page or a domain name is in the queue.
  • Add a public form to inject an URL

The next version will most probably move to the first minor release beta version (0.1.0b) since it’s now pretty stable and working fine.

Regarding the statistics, the crawlers retrieved 679,970 pages from which there is still 14,177,432 links to be parsed.

Crawling issues you should expect

When you think about crawling the entire web, you might first think it’s something pretty easy and straight forward. Open a page, read all the links, close the page. Store the results and page information you want to keep. Then open all the links, and so one.

Basically, that’s the idea. Easy. But the more you will go in this adventure, the more you will found issues and specific cases landing to situation which might cause troubles to your application.

I already discussed about the loops and duplicate pages. Like 2 pages in the same domain with different URLs serving the same content. A good example is where you have a session ID on the URL. There is some options to detect and filter them. But that’s onl if it’s inside the same domain. But what if 2 domain names are serving the exact same content? Like www.domain.com serving the exact same content as www.niamod.com. You can’t verify all the pages you already retreived to confirm if this is a duplicate or not. so at the end, you will still have some duplicates, and I don’t think there is any technical solution do avoid them.

Before loading a page, you have to validate the robots.txt file top figure if you are allowed to parse this page. Of course, you can ignore that, but it’s a good practice to read it first. Now, what if the robots.txt file format is not correct? Or what if the url to retrieve the robots file if giving you an error? Are you going to allow the crawler to read the page? Or it’s better to avoid it? When the crawlers are already working, I even found a robots.txt file bigger than 4MB… Which is bigger than the limit I have fixed for all pages in the crawler. So robot file has been discarded. At then end, there is many robots.txt file you will never be able to find, retrieve, download or even parse.

Now, you think this page is not a duplicate, the robots.txt file allow you to retrieve it, but when you are loading the page, the webserver is sending you an HTTP 419 error code. As you will figure if you search for this page, this error code is not a valid HTTP error code. Here is the list of error codes. So what’s to do with the content of this page? Is the error code just a mistake and the page is correct? Or is this an error page? For DistParser, I have decided to discard pages where error codes are not standard.

So you got a link, your looked at the robot file, your called the page and retrieved correctly. You know need to parse it. You might face many issues on the parsing side. First, HTML might not be strict and you might have some troubles toparse it. Page can also be sent as an HTML page and not be an HTML page. And when you will found links into the page, they might contain characters invalid in URLs, like pipes (|), spaces, etc. You can do some cleanup, but how can you be sure you have handle all the possible failures on the format? It’s very difficult to be 100% sure. One option is to check against a specific format and discard everything else, but that way you might miss some URLs. So you ill have to take a decision there. For DistParser, I have decided to clean the URLs as much as possible but keep them all to be sure to retrieve as much links as possible.

Last, you will need to think a bit about the data size of what you are going to retrieve and store. Let’s imagine where is an average of 32 links per page. Just go to amazon.com and count them, you will find more than 100. For each link, we need to store, at least, the link, and the associated keywords. A link is on average 32 bytes long. And for keywords, let’s say we will store almost the same think, 32 bytes. That mean, for one average page, we will have 32*32*32 = 32kb… Today, on the database, there is  3154294 URLs proposed to the crawler for parsing. This represent 96GB of data to store. Add the replication level to secure your data, the database overhead, etc and you will quickly end to many TB even with only 3M URLs. In 2008, google stored more than 1,000,000,000,000 URLs and it was 4 years ago. I let you imagine what that represents in GB. So be prepared and scale your system based on what you want to do with it.

As you can see, there is many issues you will face all along the road. And there might even be issues I have not faced yet. I initially though this will be an easy project to put in place, but I now figure it’s a big more difficult than expected.

However, with the server side beeing more stable and robust, and with all the past improvements on the client side, I think I’m close to have the first release of DistParser. The next step will most probably be to work on the data processing and see what can be extracted from the database.