Retrieving data on internet quickly is fine, but if it’s taking twice the time to store it, it’s useless…
Over the last few days I worked a lot on the performances improvement. I was able to make good progress, and I have good plan for the future.
I have identified 3 areas of possible improvement in the actual architecture.
- The disks access are very slow;
- The database inserts/updates are slow;
- The network can be improved.
To address the disk speed issue, the database just been moved to a SSD drive. Access time is about 100 times faster, and transfers rate 10 times. This will be a big improvement. Also, I have modified the way data is inserted into the database. And last, the server network connection will be changed from wireless to wired.
With the 2 first modifications, the time for a client to submit results went down from 40 seconds to less than 1 second. And from that second, 50% of it is used by network transfers. Which mean only half a second is required to submit data load and query new workload.
For the future, I’m already working on a distributed server side database to store and process the data even faster. But this is more for long term. As long as the actual database if able to serve all the crawlers requests, I will keep it in place.
So now I will most probably start to work back on the client from the last ToDo:
- Robot.txt query and parsing;
- Display daily and monthly total for the bandwidth;
- Limit the instant bandwidth usage;
- Limit the monthly threshold;
- Add a graph with the number of pages retrieves/parsed and the domains found;
- Save the bandwidth usage when the application is closed to have a daily total;
Tomorrow I will try to post an update version. So feel free to download, install and run the application.
Regarding the statistics, there is now 4 205 860 links into the database from 59 393 domains and 897 928 and those pages got already parsed.