How to discard looping pages part 1.

I don’t even know when this will last, so I can already titled it “part 1″ since I’m sure there will be more to come.

The idea here is to avoid pages loops. What I mean by page loop is a page which point to itself but where the URL and/or content is slightly different. There can be multiple kind of loops. Lets try to see some and see how (if possible) it can be detected and fixed.

Kind of loops

Loop in parameters

The first kind of loop I figured is the one based on the URL parameters.

Like http://domain.com/url?p=false is a page which contain two links which point to the page URL adding “p=true” or “p=false”. That gives you a link to http://domain.com/url?p=false&p=false and a link to http://domain.com/url?p=false&p=true .

Since it’s pointing to the same page, that will, again, build 2 new links with 2 new URLs pointing to the same page.

Loop in the path

The same way you can have loops in parameters, you can have loops in the path. Let’s imagine someone configured /url as an application on his server. This application is building a page and is putting links into it by adding /xxx and /yyy to the current URL. You will end up with a /url page containing /url/xxx and /url/yyy links. If you follow the first link, you will be redirected to a page which contain /url/xxx/xxx and /url/xxx/yyy and so on.

Loop with different URL

I figured this 3rd kind of loops only yesterday. It happend when a page is referencing itself, without adding any parameter or path, but simply by changing one parameters value. Ir can be a session id if your cookies are disabled, but it can also be a timestamp. Like http://domain.com/url?timestamp=123456789 display a page where you have a link to /url. If the page automatically add the timestamp to the links, you will have another link to http://domain.com/url?timestamp=123468126 in this page. You will land on the same page, displaying the same content, but with a different URL.

Loop detection

Loop in parameters

This kind of loop is already detected and corrected by the crawler. Basically, all the parameters are retrieved and only one occurrence of each is putted back to the URL. So at the end, all the duplicates are removed, which reduced the list of possible pages to a minimum. So webcrawlers seems to be “simply” removing the parameters.

Loop in the path

Like many other web crawlers are doing, we can detect a loop in the bath using a simple regular expression to detect if there is more than 3 times the same sub-directory in the URL. Like http://domain.com/a/b/c/b/c/b/c is containing 3 times “/b/c” and so will be discarted.

Loop with different URL

Some existing website have similar issues which might cause some troubles to the bots. A very good example is got “About Amazon” page.  The link to it (sorry, it’s a bit long) is: http://www.amazon.com/Careers-Homepage/b/ref=amb_link_5763692_2?ie=UTF8&node=239364011&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=left-4&pf_rd_r=03WW7BQHKJ96NQ6YJE8M&pf_rd_t=101&pf_rd_p=1337714942&pf_rd_i=239364011.If you click on this link, you will land on the amazon page. If you look at the left, you have a link to the current page you are browsing under the name “About Amazon”. Open this link in a new window. And now compare the 2 pages. First, you will notice that the content is different. Not the same articles displayes at the bottom, so not the same page content. Now, look at the URL. It will be almost the same as the previous one, but with a different pf_rd_m parameter. So for the bot, it’s a new content, with a new URL, so it’s a new page. Wrong. It’s the same page. But there is almost no way to detect it. And the crawlers will read them, follow the links, read the new pages, and so on, forever, until you find a way to deted that. SESSIONID parameters and other jsession parameters are causing similar issues by referencing the same page multiple times with different URLs. I searched over the web and read a lot about that the last few days but seems there is no real good solution to that. However, few filters can be put in place to reduce the amount of duplicates. The first one is to remove from the URLs known session parameters like jsessionid and others. The second is tu hash the page content and discard it on the server side if another page on the same website has the same hash code. Even with those 2 filters, amazon URL below will still be considered as a new page. Removing all the pages parameters like some crawlers are going will solve this issue, but that will also discard to many pages, like php forums threads, etc. So I wil have to think about some additionnal filters to prevent such pages to be retrieved.

Other issues.

Even if it’s possible to detect many kind of loops, there is some loops which will never be detected. Someone can build a webcrawler trap (a.k.a. bot trap). It’s as simple as an application which handle all the requests to a server, and serve them with some random link and content into it, pointing to the same domain name. Since both the URL and the content are always different, you will have no way to identify that your bot is traped and calling always the same a single page/application.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>