I promised you some time ago to share some statistics regarding the data we currently have. Since the crawlers are (almost) always running, this is just a snapshot and by the time I commit this article, it should already be different.
So as of day, there is 155 152 313 links proposed to the distributed crawlers! All those links are extracted from the 8 426 912 pages we already have received and processed.
The process performances to parse a page was running at 268 pages/minute when the project started some time ago. With some new improvements and new servers, it’S now processing more than 800 pages/minute. The goal is to reach 1000 page/minute in the next few weeks.
Today, the number of pages waiting to be processed is 0 since we are processing the pages faster that what they are submitted.
We still have in mind the project of adding some RRD graphs to display the size of the tables. We might have to time to complete that in the next few weeks too. Today’s efforts are concentrated on the client side where we found some memory leaks preventing the client to run for more than few days in a row.
As soon as this memory leak is identified (we found and fixed one, seems that there is another one), we will post the new client. It will be coming with a LOT of improvements. We will talk about that in another post.