How To Stop Search Engines From Indexing Certain Pages

By Bobby Martinez, Thursday, February 26, 2009
Person Coding Image

Did you know that you can stop Google from indexing certain pages on your website? You can, with a certain tool called “robots.txt”

Robots.txt, as you might have already guessed, is a text file that you can put on your website that will direct programs that crawl the web (web crawling bots) and give them certain directions. If you had a page that you didn’t want any web crawler to access, you could command web crawlers not to index it.

Let’s say you were running a personal website for MC Hammer, and you had a webpage concerning his finances that you didn’t want any search engines to crawl. Here’s what you would do:

1) Create a file called robots.txt in the root directory of your website. If you had www.mchammer.com <http://www.mchammer.com>, the link would look like http://www.mchammer.com/robots.txt

2) If you were trying to block web crawler access to “seenBetterDays.html” The contents of this file will look like:

User-agent: *
Disallow: /seenBetterDays.html

If the file was in a subdirectory, it would look like this:

User-agent: *
Disallow: /subdirectoryname/seenBetterDays.html

The asterisk after User-agent denotes that this rule applies to all robots, not just Google, Yahoo, or any one robot specifically.
If you wanted to disclude the entire subdirectory, it would look like this.

User-agent: *
Disallow: /subdirectoryname/

This will block web crawler access to all files within /subdirectory name/. If you wanted to disallow access to the entire subdirectory, except for the file “exception.html,”, you would put this:

User-agent: *
Disallow: /subdirectory name/
Allow: /subdirectory name/exception.html

Finally, if you decided that you’ve had enough of the internet and all its pervasive indexing and searching, you would put in this content:

User-agent: *
Disallow: /

This basically means no robot will ever visit www.mchammer.com <http://www.mchammer.com> again until you remove the file.

There are many, many, many more uses for robots.txt, but we’ve covered the basics in this article so that if you have a certain webpage or group of webpages that you don’t want to be crawled, you can just include a file and keep your information a little bit more private.

Posted in: Austin Web Design, How To, WWW Learning Center

Comments are closed.