Google

The core purpose of following effective and genuine internet marketing strategies is to attract the attention of the search engines so that it visits your website frequently. It is good by all means that search engines are making visits to your website and indexing your website contents but, situations arrive when you don’t want some parts of your contents to be indexed or accessed by the search engines. Let’s have an example of the same, suppose, you have two different versions of your web page, one is to be viewed in the browser and the other one is for printing only. Now, if you don’t exclude the printing version of your web page and allow the two pages to be indexed and crawled by the search engines, very likely, duplicate content penalty will be imposed on your website. Aside from that, excluding some images, style sheets and JavaScripts from being indexed helps saving some bandwidth as well. In both the instances, you need to separate contents that you want to be indexed and that you don’t want to be indexed right at this place robot.txt files play important role.

You can go for other alternatives as well such as; using password protected files or using Robots Meta tag but, these are not as effective as robots.txt files. The reason being, all search engines are not able to read the meta tags and under such circumstances, robots meta tag may go unnoticed by the search engines. Hence, an enhanced way to instruct the search engines about the files that are not for indexing is robots.txt.

What is Robots.txt?

It is a protocol, The Robots Exclusion Protocol (REP) which actually regulates the behavior of the search engine indexing. It is not a HTML file, it is a text file that informs the search engines about the pages that you would not like to be visited and indexed by the search engines. It is important to mention here that following the robots.txt files is not mandatory for the search engines but, most of the search engines generally obey the instructions. It doesn’t prevent search engines from accessing those files that don’t want to be accessed but, it actually defines how a search engine spider should interact with the files and pages that your website contains. It is a very simple text file, placed on the server articulating something like “Please Do Not Enter Here”.

Robots.txt to block unwanted bots

The functionality of the robots.txt file is greatly associated with where it is being placed. The files must be located in the main directory otherwise, this would be really difficult for the user agents (search engines) to find the files. The search engines are not going to search the entire website for a file named robots.txt. They preferably search the main directory first and if there is no robots.txt file then the search engines very likely assume that the site doesn’t contain any such file. This simply leads to indexing all the files and for this reason, one need to place the files obviously in the main directory.

The Robots Exclusion Protocol consists of the following things

  • The original REP from 1994 defines the crawler directives for robots.txt. There are search engines that provide supports to URI pattern extensions.
  • The extension of this protocol in 1996 defines the indexer directives for the use of robots meta elements that are known as “robots meta tag”. The webmasters have the flexibility to apply REP tag in the HTTP header of the non-HTML resources such as PDF files etc. since, search engines supports additional REP tags.

The different effective structuring of robots.txt files

This txt file allows the blocking of various contents of the webpage and the best part is that, one can define exactly what file/files he wants to be blocked. Below are some of the common and useful structures of the txt file. Before we proceed, one needs to remember that “lower case” will be used in everywhere.

Robots.txt may contain more than one records but the single record file looks like the following:

Robots.txt format

In the above example, we can see that three directories have been excluded. You need to remember that one needs separate “Disallow” line for every URL prefix that you want to exclude. Merging the URLs in a single line will not work and you cannot have blank lines in the record since, this will define multiple records. The “*” in the user agent field is as special meaning and it defines “any robot” access.

We will have some more specific structures on how to define the instruction for the user agent.

Excluding all the robots from the server

Excluding all the robots from the server

Allowing complete access to all the robots

Allowing complete access to all the robots

Excluding the robots from the different server parts

Excluding the robots from the different server parts

Excluding a single robot

Excluding a single robot

Allowing a particular robot

Allowing a particular robot

Excluding all the files except a single

Excluding all the files except a single

Priorities of robot.txt

There are few priorities that one needs to follow at the time of using robots.txt files. The very first concern is that whether your website needs robots. txt or not and if you are using such files, you need to make sure that it is not preventing the search engines from indexing those files that you want to be indexed.

There are some free tools that help instructing you that if your robots.txt is blocking something important that Google and other search engines need to know. So, it is very important to first justify whether your site needs robots.txt or not.

Reasons to have robots.txt files for your website

  • If the site contains contents that are not to be indexed by the search engines.
  • If paid links and advertisements are being used in the site that ask for special instructions for robots.
  • You want to offer convenient access to your site from reputable robots.
  • If you are creating a live site but, do not want the search engines to index the website yet.
  • In some situations, robots.txt helps following some important Google guidelines.

All of the following methods can be controlled by other alternatives although, using the robots.txt is an assured way to take care of the needs.