While I’m still working on the loops and duplicates detections, I’m still working on improving the crawler and the servers.
Here is a list of the last updates done.
- Some update done on the duplicates detection. Working quite well but still need some more work.
- Data send by the crawlers is now stored separately in order to be controlled and parsed before being sent for parsing.
- Links destination and source are now stored in the server and not only the URLs. Also, place had been reserved to store the keywords related to the links.
- On the client side, a 30 seconds delay has been added between 2 calls to the same server to avoid being banned from those servers. I personally got banned from Amazon because the crawler was calling them to quickly (no delay).
- Duplicate pages are reported so some analysis can be done.
- URL filtering/parsing has been added in both the client and the server side. On the client side to reduce the number of links sent to the server. On the server side to ensure new filters can be added quickly even if all the clients are not updated.
- Application is now able to crawl and parser servers on different ports (Not just 80) and protocols (not just HTTP).
- Robots.txt retrieval and parsing had been improved.
- Since database can now receive multiple terabytes of data, the filter on .CA urls has been removed.
- Crawlers are now provided a way better random generated list of pages to parse.
And here are the next updates I’m working one:
- On the client side, store the last 1000 failing URLs to have an idea of the network issues occurring with the crawler.
- On the server side, analyze the duplicated pages to understand if there is any pattern (URL parameter, loop, etc) that can be detected and added to the filters.
- Add parsing of of document format (PDF, etc.)
- Start the work on the page ranking.
- And still all the required improvements on the client side (User interface to update the parameters, etc.)
All the last modifications I made are compatible with the existing client, so there should not be any issue with the running applications. Regarding the work already done as of today, I had to reset the data. Because of the lack of loops/duplicates detection, there was more than 60% of the content duplicated. As a result, the data quality was not good enough, and cleaning it will have take more time than what it’s going to take to read it again.
So the number of pages retrieved come down from 12M to only 3000 entries but will grow up very quickly. A statistic page will be put in place soon to provide live information about the data size.