Internet Marketing Tips, Suggestions, & Ramblings

Introduction to Robots.txt for Google

What you need to know about Robots.txt for Google

OK, who is ready to talk about Robots.txt files? Is everyone excited?!!! I’m sure there is nothing else you would rather be discussing–sports, celebrities, anything else, etc. But frankly, Robots.txt files are important to your online business and search engine optimization. So let’s take a few moments to review the most important Robots.txt points, as specified by the Google search engine.

Google Code Search can be your best friend in times like this, especially if you need to create an advanced robots.txt file for a large site. But even then, it still helps to have experince on your site.
Google Code Search can be your best friend in times like this, especially if you need to create an advanced robots.txt file for a large site. But even then, it still helps to have experince on your site.

Robots.txt Location

You are always going to want to place your robots.txt file as and immediate directory following the home page. Here is an example:

http://www.example.com/robots.txt

Now, if you like you actually can place your robots.txt file on a subdomain directory such as:

Or on non-standard ports like

But you cannot place a robots.txt file in subdirectory such as:

http://example.com/pages/robots.txt

Why can’t you do this? Well, if because Google says so isn’t enough for you, you can visit this page for specifics. Here is a statement from Google on the subject.

“The robots.txt file must be in the top-level directory of the host, accessible though the appropriate protocol and port number. Generally accepted protocols for robots.txt (and crawling of websites) are “http” and “https”. On http and https, the robots.txt file is fetched using a HTTP non-conditional GET request.”

What should you place in Robots.txt file?

The main purpose of the robots.txt file is to hide content which you do not want to be found by the search engines. If you do not want to hide anything, your robots.txt file would look like this.

User-agent: *
Disallow:

Sitemap: http://www.example.com/sitemap.xml

The sitemap is placed in the robots.txt file to allow easy access for the search engine. If you have something that you would like to hide from the search engines, like a calendar or junk section, that would look like this.

User-agent: *
Disallow: /calendar/
Disallow: /junk/

Robots Meta Tags

Now if you would like to use a robots meta tag to ensure a page is not crawled, such as maybe a privacy policy which there is no reason for indexing, you can go ahead and add this to the top of an HTML page in the <head> section.

<!DOCTYPE html>
<html><head>
<meta name=”robots” value=”noindex” />

This particular meta tag (noindex) will block the Google search engine from indexing the page. While this is the case, sometimes if many links are pointing into a page, a rare situation for a poor piece of content, the page still may gets indexed.

Non-HTML Content

In the case that you would like to block non-HTML content, you cannot use a simple noindex. Instead, you’re going to need to use a X-Robots-Tag HTTP Header. The X-Robots-Tag would be included in with the other HTTP header tags, which would look something like this.

$ curl -I “http://www.google.com/support/forum/p/Webmasters/search?hl=en&q=test”
HTTP/1.1 200 OK
X-Robots-Tag: noindex
Content-Type: text/html; charset=UTF-8
(…)

Summary
These are just some basic ways to get started with robots.txt. In most cases, websites do not need to hide anything from search engines. In fact, if you have content why not let it get indexed, right? Generally, you need plenty of content for search engine marketing. While this is the case, sometimes certain things should be kept under wraps. If you have questions or comments about robots.txt, robot meta tags and x-robots-tag and Google crawlers ask below!

About Garry Grant

Garry Grant is a veteran expert in search engine optimization and the digital marketing industry. With nearly 20 years of experience, Garry has successfully built a multi-service operation at SEO, Inc., developing proprietary technologies through complex strategic solutions. He has extensive experience in key initiatives and operational responsibilities grounded in information technology and performance management.

Garry’s expertise and esteemed reputation, coupled with SEO Inc.’s impressive client success record has earned him such accolades as Entrepreneur Magazine's 2005 Hot List for the Hottest Internet Property, Inc. 500 2007 Honorary award for Fastest Growing Private Company in America, an Inc. 500 top 50 Company in San Diego, and interviews with The New York Times, The Wall Street Journal, WIRED, Entrepreneur and The Huffington Post.

Garry Grant began his online career in 1993 creating strategic Web and e-business solutions for Homepage.com, The Rush Limbaugh Show, Premiere Radio Networks, Clear Channel Communications, EarthLink and Artisan Motion Pictures. Today, Garry and SEO Inc.’s highly skilled digital strategists develop proprietary technology and strategic digital marketing direction for Fortune 500 companies including, SC Johnson, McAfee, Entrepreneur.com., Inc Magazine, IGN, Tacorri, LPL Financial, National Kidney Foundation, G4 TV, Fuel TV and Sony, just to name a few.