Today’s status

Today I did not get a chance to work on what I have decided yesterday. There is now 2697274 URLs in the database and 25107. 893001 pages already got parsed. The problem is that now the server performances. It takes 40 seconds for the server to process the results and almost 20 seconds go generate workload. When it’s taking sometime 30 seconds for a client to parse an entire workload, that mean on single client will be faster than the server… Useless. So I had to migrate the application to another server. The previous server was an old 1.2Mhz fanless computer with 1GB only. The new server is way more efficient and might be able to serve at least 60 clients. It’s better, but still not enough. So in the meantime I will have to take a look at the database to see how I can improve the load. I worked on that few hours today with no results. I will try to spend some more time on that over the week-end.

I also got some time to think and work on a cycle detection. The one I have put in place is quite nice. It’s doing a nice job on many entries and reduced a lot of work. I figured that there might be many duplicates in the database. Maybe I will have to reset it. Anyway I have few ideas into my head to optimize all of that and some of them will required a server data structure update… So not sure I will be able to keep the actual data, but I will try very hard to! More to come…

You can still download the client. I have updated to version 0.0.4b. It’s now very fast. so be careful with your bandwidth usage.

So from yesterday’s goals, here what has been done:

  • Ensure the application title is always refreshed even when sending results or loading new workloads.
  • Continue to work on cycle detection.
  • Improve the server performance.

And here what is still to be done:

  • Display daily and monthly total for the bandwidth
  • Limit the instant bandwidth usage
  • Limit the monthly threshold.
  • Add a graph with the number of pages retrieves/parsed and the domains found.
  • Improve the server performance. (Need more improvement)
  • Save the bandwidth usage when the application is closed to have a daily total.

2697274

Today’s status

Are usual, some updates for today. But not so much.

First, the results. We are at 1 358 312 links found in 18 753 different domains. 629 909 pages got already retrieved and parsed.

I tried to work on cycle detection for the URLs with no success. The perfect example of the need for a cycle detection is the page http://www.abilities.ca/agc/disclaimer.php where there is a link inside (view source) which add “&screenreader=on” to the URL. When you click on it you are redirected to http://www.abilities.ca/agc/disclaimer.php?&screenreader=on where you have a link inside (view source) which add “&screenreader=on” to the URL. When you click on it you are redirected to http://www.abilities.ca/agc/disclaimer.php?&screenreader=on&screenreader=on. I think you can imagine the rest.

One way to detect can be to start from the end, read the parameter, and see if the one before is the same. But what if there is 2 parameters? Like http://URL/path?&p1=a&p2=b&p1=a&p2=b. The last 2 parameters are not the same. Ok, I can search for part of the string, like “p1=a&p2=b”. But what if the order change? http://URL/path?&p1=a&p2=b&p2=b&p1=a will not be detected. Another solution is to parse all the parameters, remove the duplicates, and put them back. Ok, but if the URL is not using standard delimiters? Like http://URL/path?&p1=a;p2=b;p1=a;p2=b… It’s even worst when there is malformed URLs. And so on.

There is so many possibilities that I came to the conclusion that a perfect cycle detection is, unfortunately, almost impossible.

So instead of wasting time for a very complicated cycle detection, I will implement a simple one. I you want to help and provide one, you are welcome! I will just parse the parameters if there is any and if they have a standard format, and remove the duplicates if any. If I found any “strange” character or format, I will simply skip the cycle detection and continue with the URL. At some point anyway, the URL will be trimmed to its maximum length. But if there is more than one link like that in a page, this can end to thousands of the same page retrieval.

I was also supposed to work on the bandwidth usage. I have started the work. So far, the total bandwidth used for upload and download is displayed, and so is the instant bandwidth usage. Instant bandwidth usage is the average bandwidth used for the last 60 seconds, so it’s displayed only after 60 seconds. Even if I tried to keep this information as accurate as possible by including headers size, URL sizes, calls the the DistPaser server, etc. there might be some communications that I’m missing and which at the end are missing from the total. So this can be used to have a good idea of the tool bandwith usage, but you might still need to track your bandwidth usage from your provider just to not have any bad surprise. Over the next few weeks I will try to compare what the tool is giving me with my real bandwidth usage and see if I’m close.

I have updated the main loop of the tool to reduce the CPU usage. Previous version (0.0.2b) was using 100% of the CPU to refresh the display. Now the display is refreshed only every seconds and the entire application is using only less than 2%.

I have also put in place a configuration file which you can found in your user directory. This configuration file contain the number of concurrent crawlers you want to use and the number of workload you wand to prepare. There is no validation made for those values, so using out of range values might cause the application to crash.

One last thing. The title of the application is displaying information about the work beeing done. From left to right, you will fine, for each workload loaded “Number of URL to retrieve”/”Number of crawlers” followed by |. Then you have upload/download (in bytes) and then the bandwidth instant usage.
Again, version 0.0.3b is available for download. Feel free to download and run it.

Todo for the next days:

  • Ensure the application title is always refreshed even when sending results or loading new workloads.
  • Save the bandwidth usage when the application is closed to have a daily total.
  • Display daily and monthly total for the bandwidth
  • Limit the instant bandwidth usage
  • Limit the monthly threshold.
  • Add a graph with the number of pages retrieves/parsed and the domains found.
  • Continue to work on cycle detection.
  • Improve the server performance.

Today’s status

I got the chance to work a bit on the tool today, based on yesterday’s todo list.

Everything has been completed based on that plan, expect that there is still no participants 😉 But that’s normal, because I have not yet ad this site anywhere.

As a result, the tool is way more stable. It’s reducing the load on a single server but is still fast, and is more useful since we can stop it and restart it when we want.

The crawler worked almost all day long and there is now more than 1 million pages into the database, with more than 17 000 domains. As of today, 467212 of those pages got retrieved and parsed. Which is a good number!

By looking at the results, I figured that some URLs are looping.

I mean let’s consider http://distparser.com/thispage. If there is a link on this page to call “current url of the page” + “param=on”, I will get http://distparser.com/thispage?param=on and on the next ready http://distparser.com/thispage?param=on&param=on and so on. At some point, it stop because of the max length of the URL, but I will have thousands of URLs in the database before that. Which mean I need to find a way to avoid that.

I will put the new version for download in the next 10 minutes. Feel free to try it.

This is most probably what I will word on tomorrow.

So here is the todo for tomorrow:

  • Filter URL loops.;
  • Start to work on the bandwidth limitation.

 

Today’s status.

Today, after about 2 days of slowly crawling and parsing, the bot found about 15000 domain names and 700 000 pages names from the 177 000 pages retrieved. Which is good for a start. Bot will be running longer today and tomorrow.

I’m currently updating crawler architecture to improve his speed.

Todo for next days/releases:

  • Allow crawler to be closed at any time by saving the current workload;
  • Allow crawler to start back from a previous saved workload;
  • Allow crawler to ask another workload even if the previous one is not totally completed;
  • Allow crawler to start to work on the next workload if the previous one is close to completion (i.e. there is some walkers not used);
  • Limit the number of domains called from the crawler to not overwhelmed the servers.
  • Find some participants 😉