I got the time to work a bit more on the client those last few days, and therefor, here is an updated release. Just click here to go to the download page.
Here are all the improvements for this release:
- Improvement of the URL format validation to reduce the number of calls to wrongly formated URLs;
- Introduction of the -ui parameter. Calling without this parameter will start the application in command line only (See 1) below);
- Now dynamically display the DistParser UI based on the number of threads configured;
- Page language detection has been added in the client side (EN and FR so far);
- Implementation of lateral caching for robots.txt files (See 2) below);
- Implementation of commands (stats, exit, save) (See 3) below);
- While waiting to exit the application, logs are more verbose to show the exit progress;
- Improvement timeout management. This will reduce the delay between when you ask the application to exit and when it really closes.
- The distributed crawler now comes with an option to display or not the user interface. If you still want to have the graphical user interface, simply add -ui in the command line. Else, it will be hidden. When the UI is hidden, you can exit the application simply typing “exit” + enter. If you want to keep the application running in background, you can start it using “screen”.
- Lateral caching has been adding too. It’s useful with you are running more than one client in the same network. For each retrieved URL, the client need to validate against the robots.txt file that we are allowed to read this file. If we don’t have the robots.txt file, we need to download it from the server. Lateral caching allow to share this file with the other clients in the same network, reducing the bandwidth usage if another client need to retrieve the same file. Robots.txt file with be taken directly from the cache. The client is configured to keep 128K different robots files. Robots files expire after 7 days to make sure we get recent data. If you stop one of your crawlers, you might see some exceptions in the logs because the cache will lost a node. Just ignore them.
- You now can use few commands to control the crawler. You can use those commands when you are starting it without the user interface, but you can also still use then when using the graphical interface. So far only 3 commands are available, but more will come. You can you the “exit” command to close the application (need to complete the current work first). You can use the “save” command to force the application to save the data, and finally you can use the “cache” command to display the cache status.
Also, on the server side, some cleanup has been made in the database to remove malformed retrieved URLs too. Therefor, the number of links waiting to be parser is now down to 150 402 200. Some improvements took place in the servers configuration too.