A lot append those last few days on the DistParser application.
First, again, about performances. The server is now able to serve new workload and handle results within less than half a second (About 200ms). This is WAY faster than the 40 seconds response time from a week ago. I don’t think there is much more thing which can be improved for now. When the size of the data will grow, I will move to a clustered architecture. But for now, the regular database can still handle the load. So I might not have to talk about performances for some time.
Client side, I made many updates to the application.
Everything from the last todo has been done! And even more.
- Robots.txt query and parsing;
- Robots.txt caching;
- Display daily and monthly total for the bandwidth;
- Limit the instant bandwidth usage;
- Limit the monthly threshold;
- Limit of download/upload speed;
- Save the bandwidth usage when the application is closed to have a daily total;
- Validate URLs format;
- Add logs.
First, regarding the robots.txt rule. The crawler is now downloading the robots.txt file before querying any URL. The required URL is validated against this file. Also, to reduce the bandwidth usage, this robots.txt file is stored locally on a disk cache. The cache parameters are not yet configurable but will be in a near future.
Regarding the bandwidth usage, the application is now keeping track of all the bandwidth used, including upload and download, and is displaying the daily usage and the monthly usage in the title bar. Monthly usage billing period can be configured to start on a specific day of the month. Also, daily and monthly total usage limitation are now in place. So you configure the tool correctly, you can let it run as often as you want without risking of going over your provider allowed bandwidth. The bandwidth is stored in a local file. So each time you are closing the crawler, the information is stored. Each time you are opening it back, information is restored. So you will always see accurate numbers.
On the database side, since there was no URL validation, many malformed URL were stored and are generating crawlers work which can be avoid. So an URL format validation has been added. A database cleanup has been made, which has reduced a lot the number of URLs stored. Also, a defect on the storage side has corrupted some URL identification. I had to delete those entries too.
As a result, the database now stores 2 212 550 URLs
Many other small updates/improvements were made too, like the add of the logs to figure what the application is doing.
So for the next release, here are the new features I would like to add:
- Allow robots.txt disk cache size configuration;
- Build an interface to see/update parameters;
- Build an interface to see crawler statistics (Cache info, bandwidth, etc.);
- Add a graph with the number of pages retrieves/parsed and the domains found;
- Improve Domain vs URL split rules (to handle ? split when ? not present);
- Rename application packages.