If you want to participate to this project, you simply have to download the application and run it. The application will not require you to install it. It will run from where it’s stored. It is made in Java, so the only constraint will be for you to have a recent Java Virtual Machine installed. All recent computer should already have it.
You can run this application with its frontend to see what it’s doing, or you can run it on server side without graphical interface displayed (not available yet). You can stop the application whenever you want, but plus allow it few seconds to save the work in progress.
When the application is launched for the first time, it creates multiple files in the directory when it’s run.
First, you will see a “robotCache” directory created. It’s used to store the robots.txt files retrieved to limit the bandwidth used to get them.
bandwidth.log file is used to store your bandwidth usage to display what’s used and stop when the tool is reaching configured limit.
crawler.dat file is used to store the crawler workloads.
DistParser.log is used to store the tool’s logs. This is the file we will ask you to share if you are facing any issue with the application.
english.txt and francais.txt are dictionary files downloaded by the application for it’s usage.
And last, crawler.properties which will contain the Crawler configuration.
If you want to setup the application with your own parameters, here is the list of options the application can get, and what they are used for.
Configure the max download speed, in bytes per seconds. Default value is 200kb/s.
Configure the number of walkers we want to have at the same time. The more you have, the faster the download will be. Default value is 10.
Configure the max upload speed, in bytes per seconds. Default value is 40kb/s.
Configure the daily usage limitation. Default value is 1GB
Configure the monthly usage limitation. Default value is 40GB
Configure the first day of the month for you monthly usage. Useful when your billing cycle is not starting on the 1st. Default value is 1.
Configure the number of workloads the crawler should always keep to allow walkers to work. There can’t be more than 3 walkers working on the same domain (based on the domain name). So it’s recommended to have more than one workload at a time in case there is many URLs to parse from the same domain. That will allow walkers to work on some other workload.