What you need to know about Robots.txt for Google
OK, who is ready to talk about Robots.txt files? Is everyone excited?!!! I’m sure there is nothing else you would rather be discussing–sports, celebrities, anything else, etc. But frankly, Robots.txt files are important to your online business and search engine optimization. So let’s take a few moments to review the most important Robots.txt points, as specified by the Google search engine.
You are always going to want to place your robots.txt file as and immediate directory following the home page. Here is an example:
Now, if you like you actually can place your robots.txt file on a subdomain directory such as:
Or on non-standard ports like
But you cannot place a robots.txt file in subdirectory such as:
Why can’t you do this? Well, if because Google says so isn’t enough for you, you can visit this page for specifics. Here is a statement from Google on the subject.
“The robots.txt file must be in the top-level directory of the host, accessible though the appropriate protocol and port number. Generally accepted protocols for robots.txt (and crawling of websites) are “http” and “https”. On http and https, the robots.txt file is fetched using a HTTP non-conditional GET request.”
What should you place in Robots.txt file?
The main purpose of the robots.txt file is to hide content which you do not want to be found by the search engines. If you do not want to hide anything, your robots.txt file would look like this.
The sitemap is placed in the robots.txt file to allow easy access for the search engine. If you have something that you would like to hide from the search engines, like a calendar or junk section, that would look like this.
Robots Meta Tags
<meta name=”robots” value=”noindex” />
This particular meta tag (noindex) will block the Google search engine from indexing the page. While this is the case, sometimes if many links are pointing into a page, a rare situation for a poor piece of content, the page still may gets indexed.
In the case that you would like to block non-HTML content, you cannot use a simple noindex. Instead, you’re going to need to use a X-Robots-Tag HTTP Header. The X-Robots-Tag would be included in with the other HTTP header tags, which would look something like this.
$ curl -I “http://www.google.com/support/forum/p/Webmasters/search?hl=en&q=test”
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
These are just some basic ways to get started with robots.txt. In most cases, websites do not need to hide anything from search engines. In fact, if you have content why not let it get indexed, right? Generally, you need plenty of content for search engine marketing. While this is the case, sometimes certain things should be kept under wraps. If you have questions or comments about robots.txt, robot meta tags and x-robots-tag and Google crawlers ask below!