Are usual, some updates for today. But not so much.
First, the results. We are at 1 358 312 links found in 18 753 different domains. 629 909 pages got already retrieved and parsed.
I tried to work on cycle detection for the URLs with no success. The perfect example of the need for a cycle detection is the page http://www.abilities.ca/agc/disclaimer.php where there is a link inside (view source) which add “&screenreader=on” to the URL. When you click on it you are redirected to http://www.abilities.ca/agc/disclaimer.php?&screenreader=on where you have a link inside (view source) which add “&screenreader=on” to the URL. When you click on it you are redirected to http://www.abilities.ca/agc/disclaimer.php?&screenreader=on&screenreader=on. I think you can imagine the rest.
One way to detect can be to start from the end, read the parameter, and see if the one before is the same. But what if there is 2 parameters? Like http://URL/path?&p1=a&p2=b&p1=a&p2=b. The last 2 parameters are not the same. Ok, I can search for part of the string, like “p1=a&p2=b”. But what if the order change? http://URL/path?&p1=a&p2=b&p2=b&p1=a will not be detected. Another solution is to parse all the parameters, remove the duplicates, and put them back. Ok, but if the URL is not using standard delimiters? Like http://URL/path?&p1=a;p2=b;p1=a;p2=b… It’s even worst when there is malformed URLs. And so on.
There is so many possibilities that I came to the conclusion that a perfect cycle detection is, unfortunately, almost impossible.
So instead of wasting time for a very complicated cycle detection, I will implement a simple one. I you want to help and provide one, you are welcome! I will just parse the parameters if there is any and if they have a standard format, and remove the duplicates if any. If I found any “strange” character or format, I will simply skip the cycle detection and continue with the URL. At some point anyway, the URL will be trimmed to its maximum length. But if there is more than one link like that in a page, this can end to thousands of the same page retrieval.
I was also supposed to work on the bandwidth usage. I have started the work. So far, the total bandwidth used for upload and download is displayed, and so is the instant bandwidth usage. Instant bandwidth usage is the average bandwidth used for the last 60 seconds, so it’s displayed only after 60 seconds. Even if I tried to keep this information as accurate as possible by including headers size, URL sizes, calls the the DistPaser server, etc. there might be some communications that I’m missing and which at the end are missing from the total. So this can be used to have a good idea of the tool bandwith usage, but you might still need to track your bandwidth usage from your provider just to not have any bad surprise. Over the next few weeks I will try to compare what the tool is giving me with my real bandwidth usage and see if I’m close.
I have updated the main loop of the tool to reduce the CPU usage. Previous version (0.0.2b) was using 100% of the CPU to refresh the display. Now the display is refreshed only every seconds and the entire application is using only less than 2%.
I have also put in place a configuration file which you can found in your user directory. This configuration file contain the number of concurrent crawlers you want to use and the number of workload you wand to prepare. There is no validation made for those values, so using out of range values might cause the application to crash.
One last thing. The title of the application is displaying information about the work beeing done. From left to right, you will fine, for each workload loaded “Number of URL to retrieve”/”Number of crawlers” followed by |. Then you have upload/download (in bytes) and then the bandwidth instant usage.
Again, version 0.0.3b is available for download. Feel free to download and run it.
Todo for the next days:
- Ensure the application title is always refreshed even when sending results or loading new workloads.
- Save the bandwidth usage when the application is closed to have a daily total.
- Display daily and monthly total for the bandwidth
- Limit the instant bandwidth usage
- Limit the monthly threshold.
- Add a graph with the number of pages retrieves/parsed and the domains found.
- Continue to work on cycle detection.
- Improve the server performance.