New release

It has been a while since I have post a new version. So here we go.

I just uploaded version 0.0.8b in the download section.

Before you try it, please make sure you turn your queue size to 0 and you wait for all the current work to be done and sent. The current file format and the new one are not compatible.

Since new release is introducing many improvements:

  • Fixed a memory leak issue;
  • Workloads are now only on the disk and no more in memory;
  • Results are stored on the disk and no more in memory;
  • Workload load and save had been greatly improved to reduce load and save time;
  • JAR file size has been reduced by remove un-used libraries;
  • A protocol is now sent between the client and the DistParser server;
  • Overall memory usage has been reduced.

So feel free to download and test this updated version. In a soon release, user identification will be added and previous version might not be working anymore.

Statistics

I promised you some time ago to share some statistics regarding the data we currently have. Since the crawlers are (almost) always running, this is just a snapshot and by the time I commit this article, it should already be different.

So as of day, there is 155 152 313 links proposed to the distributed crawlers! All those links are extracted from the 8 426 912 pages we already have received and processed.

The process performances to parse a page was running at 268 pages/minute when the project started some time ago. With some new improvements and new servers, it’S now processing more than 800 pages/minute. The goal is to reach 1000 page/minute in the next few weeks.

Today, the number of pages waiting to be processed is 0 since we are processing the pages faster that what they are submitted.

We still have in mind the project of adding some RRD graphs to display the size of the tables. We might have to time to complete that in the next few weeks too. Today’s efforts are concentrated on the client side where we found some memory leaks preventing the client to run for more than few days in a row.

As soon as this memory leak is identified (we found and fixed one, seems that there is another one), we will post the new client. It will be coming with a LOT of improvements. We will talk about that in another post.