About Robots.txt file
It is great when search engines frequently
visit your site and index your content but often there are cases when indexing
parts of your online content is not what you want. This can be achieved by
creating a simple text file on the root path of your server and naming it to
exactly robots.txt. So, in other
words, a robots.txt is a file placed on your server to tell the various
search engine spiders not to crawl or index certain sections or pages of your
site. You can use it to prevent indexing totally, prevent certain areas of your
site from being indexes or to issue individual indexing instructions to
specific search engines.
One thing to note down here is that the file
Robots.txt is by no means mandatory for search engines but generally search
engines obey what they are asked not to do.
How and where to
create it?
The file itself is a simple text file, which
can be created in Notepad or whatever is your favorite text editor. It needs to
be saved to the root directory of your site, which is the directory where your
home page or index page is. Misspelling is obvious, so be sure to name it
correctly as "Robots.txt" and not "Robot.txt".
Structure of
Robots.txt file
The structure of a robots.txt is pretty
simple (and barely flexible) – it is an endless list of user agents and
disallowed files and directories. Basically, the syntax is as follows:
User-Agent: [Spider or Bot Name]Disallow: [Directory
or Specific File Name]
User-agent: are search engines' crawlers or
bots and disallow: means the list of files and directories to be excluded from
indexing. Also if you want to include comment lines – just put the # sign at
the beginning of the line:
# All user agents are
disallowed to index the secure directory.User-agent: * Disallow: /secure/
Examples of Usage
A few examples will make it clearer to how
to properly write contents in robots.txt file.
Exclude all robots
from the entire web site
User-agent: * Disallow: /
Allow all robots
from the entire web site
The only difference from above is to omit
the trailing '/' in Disallow section. Alternatively you can either create an
empty robots.txt file or don't create any.
User-agent: * Disallow:
Exclude a part of
your web site
e.g. if you wish to exclude some directories
(not all), then you may use the following syntax.
User-agent: * Disallow: /cgi-bin/Disallow: /private/Disallow: /secure/Disallow: /temp/
Exclude a single bot
from entire web site
User-agent: Slurp Disallow: /
Allow a single bot
User-agent: Googlebot Disallow: User-agent: * Disallow: /
Exclude a file from
an Individual Search Engine
e.g you want to exclude your mydata.htm file
that is placed under 'secure' directory from Google. (the name of Google bot
that indexes pages is Googlebot)
User-agent: Googlebot Disallow:
/secure/mydata.htm
Exclude a file from
all Search Engines
User-agent: * Disallow:
/secure/mydata.htm
Handling Complex
Situations
Also you can combine multiple command one
after another to handle complex situations. Let's take a bit complicated
example in step by step manner.
(1) First you would ban all search engines
from the directories you do not want indexed at all:
User-agent: * Disallow: /cgi-bin/Disallow: /private/Disallow: /secure/Disallow: /temp/
(2) Next, suppose you want to exclude Yahoo
from the entire web site:
User-agent: Slurp Disallow: /
(3) Further, if you want to exclude Google
from indexing the images from your web site:
User-agent: Googlebot-Image Disallow: /Images/Disallow:
/Public/Images/
(4) Again, if you want to exclude certain
files from all spiders:
User-agent: * Disallow:
/private/mybio.htm
No comments:
Post a Comment