Thursday, January 20, 2011

Robots in action

Robots in action
What is robots.txt?

When a search engine crawler comes to your site, it will look for a special file on your site. That file is called robots.txt and it tells the search engine spider, which Web pages of your site should be indexed and which Web pages should be ignored. (i.e. it is not a firewall, or a kind of password protection). The robots.txt file is a simple text file (no HTML), that must be placed in your root directory, for example:

The location of robots.txt is very important. It must be in the main directory because otherwise user agents (search engines) will not be able to find it – they do not search the whole site for a file named robots.txt. Instead, they look first in the main directory  and if they don't find it there, they simply assume that this site does not have a robots.txt file and therefore they index everything they find along the way. So, if you don't put robots.txt in the right place, do not be surprised that search engines index your whole site.

Where it is in action?

Significantly the precence of a robot.txt file may be ,
  1. when we have two versions of a page (one for viewing in the browser and one for printing), we'd rather have the printing version excluded from crawling, (otherwise  risk being imposed a duplicate content penalty) 
  2. where it happen to have sensitive data on a site that presence information do not want the world to see, you will also prefer that search engines do not index these pages . 
  3. Additionally,where we want to save some bandwidth by excluding images, stylesheets and javascript from indexing, it is  also need a way to tell spiders to keep away from these items.

Why Do my site Need One?

All search engines, or at least all the important ones, now look for a robots.txt file as soon their spiders or bots arrive on your site. So, even if you currently do not need to exclude the spiders from any part of your site, having a robots.txt file is still a good idea, it can act as a sort of invitation into your site.
There are a number of situations where you may wish to exclude spiders from some or all of your site.
  1. You are still building the site, or certain pages, and do not want the unfinished work to appear in search engines
  2. You have information that, while not sensitive enough to bother password protecting, is of no interest to anyone but those it is intended for and you would prefer it did not appear in search engines.
  3. Most people will have some directories they would prefer were not crawled - for example do you really need to have your cgi-bin indexed? Or a directory that simply contains thank you or error pages.
  4. If you are using doorway pages (similar pages, each optimized for an individual search engine) you may wish to ensure that individual robots do not have access to all of them. This is important in order to avoid being penalized for spamming a search engine with a series of overly similar pages.
  5. You would like to exclude some bots or spiders altogether, for example those from search engines you do not want to appear in or those whose chief purpose is collecting email addresses.
The very fact that search engines are looking for them is reason enough to put one on your site. Have you looked at your site statistics recently? If your stats include a section on 'files not found', you are sure to see many entries where search engines spiders looked for, and failed to find, a robots.txt file on your site.


No comments:

Post a Comment