Stream parsing improvement.

First thing, you will notice that I have decided to be more creative for the titles 😉 I will try to use as a title the main subject of the post.

Over the week-end, I worked on the server performances. It was way to slow.

First, I moved the server and the database to a new server about 8 time faster than the previous one with 8 times the memory. That will help to increase the number of supporter crawler. I have estimated 60 participants to be able to be handled by the server. Which mean I can now expect the server to be able to handle up to 480 participants. Also, I worked on the part of the code which handle the stream received from the crawler and prepare it for treatment. The initial performance on the dev machine was 94315 pages handled per seconds. After optimization, it’s now able to handle 366709. I looked at the code so many times that I think there is no other way to improve it. The size of the stream transferred as been reduced by only few bytes, but at the end it’s almost 4 times faster. Which is a very good improvement.

On the other side, I build the migration procedure to migrate the existing database to the new schema. It’s taking about one hour to process the table. So there will be a one hour down time for the server sometime in the middle of next week. I’m also expecting a big improvement on that side. For now, it’s taking up to 3 seconds to submit the 500 pages parsing results, and retrieve more work. I will see what will be the improvement but my goal is to go under one seconds for the total of the two operations.

Regarding the ToDo from the last post, here is what still need to be done:

  • Display daily and monthly total for the bandwidth
  • Limit the instant bandwidth usage
  • Limit the monthly threshold.
  • Add a graph with the number of pages retrieves/parsed and the domains found.
  • Save the bandwidth usage when the application is closed to have a daily total.

All of that is on the client side. I will first need to complete the server side improvement before I take care of that. So I’m not expecting any of those items to be done before next week-end.

So far, from what I can see on the logs, 2 clients are running. One here on the dev machine, and another one, which is also me 😉 So feel free to download the application and run it if you want to participate.

I’m not publishing any version today because the client is not working any more because of the server modification. So please us the version available in the Download section.

And to conclude, here are the statistics after almost one week of parsing. The crawlers retrieved 2944614 different URLs and found 26169 domains. 888279 of the retrived links are already parsed.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>