Some new improvements.

I just uploaded a new version of the client. You will find it on the download section.

This new version is introducing an “exit” button which allow you to close the application nicely. This is stopping all the threads and it’s waiting for them to parse the links from the page they are working on. Parsing the links can take some time since the crawler need to retrieve the robots files to know if we can add those links into the database or not. Some pages have more than 2000 links… As a result, this can take some time (sometime up to 5 minutes).

So when you click on exit, the button become “Force” and will allow you to force the exit of the application by even stopping the link parsing for the pages retrieved. This will accelerate the application closure, but since timeout for the connections is set at 45 seconds, this migh take up to 1m30. (45 minutes for the socket connection and few retries). The links currently opened will be lost, but they will be re-assigned in a near futur when they will timeout on the server side too.

However, there is one situation where the closure can’t be accelerated. This is when the crawler is communicating with the server to send the results or retrieve some workload. Based on the server usage, your internet bandwidth and the size of the data beeing sent/retrieve, this might take some time. A green light indicates that no communications are in progress. But when the light is turning red, that mean the crawler is communicating with the server and can’t exit now. You still can clic on “exit” while the client and the server are communicating, but the crawler will not directly exit. It will have to wait for the transfert to be done.

Those few improvements are made to give you a cleaner way to close the application, but also to make sure that no work is lost by mistake when wrongly closing the application.

Few other improvements were also embeded in this new release.

First, the application startup speed has been improved. Some more work will still be required there but it’s better.
The application instant bandwidth limitation is more accurate than before.
You now can use 0 as the bandwidth limite to disable it.
Current workload restore and save format got improved. Will avoid some failures and workload lost.

Here are the next features I will include on the next client:

  • User identification;
  • User interface to modify crawler options;
  • Allow user to select preferred domain extension (.com, .co.in, etc.);
  • Client version check again server version.

And the todo list on the server side:

  • Add a public form to check if a page or a domain name is in the queue.
  • Add a public form to inject an URL

The next version will most probably move to the first minor release beta version (0.1.0b) since it’s now pretty stable and working fine.

Regarding the statistics, the crawlers retrieved 679,970 pages from which there is still 14,177,432 links to be parsed.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>