Robots.txt files are one of the ways a SEO company may attempt to prevent certain pages from being indexed. For example, if someone would like to disallow the pages directory they would insert the following into their Robots.txt file.
While this is one of the correct uses of the file, there are some things you should be aware of regarding robots.txt from an SEO perspective.
Losing Link Juice
One of the most important, if not the most important, ranking factor in SEO is link popularity. Essentially, when you implement a Disallow through Robots.txt on a section or page of your website you hurt the potential for that portion of your website to transfer link authority to other areas of your site.
So for example, say you disallow your login page, maybe you figure it does not target specific keywords for SEO so there is no reason for having it in the index. I am not saying I agree with this but it could be your train of thought. The weight from external links pointing at that page on your site will not be funneled through text links on the page, the reason being, Google does not crawl the text on the disallowed page. Here we see a quote from Google webmaster central on the subject.
“While Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.”
Here we see that Google will not crawl the content on the page but still may index the URL. We will talk about the URL and offsite content regarding the page being indexed in a moment. But first, let’s quickly wrap up this point on losing link juice. The main point is that when robots.txt is used to block a page on your site external links pointing at the disallowed page will not be able to easily transfer their authority to other areas of your website, as the internal links are blocked by Google through the robots.txt file.
Pages May Still Appear
Now back to our point on pages staying in the index although they have been blocked. According to the Google Code FAQ page on robots.txt, pages may still appear even though they are disallowed in a robots.txt file.
“Blocking Google from crawling a page is likely to decrease that page’s ranking or cause it to drop out altogether over time. It may also reduce the amount of detail provided to users in the text below the search result. This is because without the page’s content, the search engine has much less information to work with.”
Google goes on to further this idea stating.
“However, robots.txt Disallow does not guarantee that a page will not appear in results: Google may still decide, based on external information such as incoming links, that it is relevant. If you wish to explicitly block a page from being indexed, you should instead use the
noindex robots meta tag or X-Robots-Tag HTTP header. In this case, you should not disallow the page in robots.txt, because the page must be crawled in order for the tag to be seen and obeyed.”
Using the noindex Meta Header
As Google has stated in the quote above, in some cases, there is a better option for disallowing pages than utilizing the Robots.txt file. That being, using the following Meta Tags:
No Index Meta Tag
X-Robots Meta Tag
Lets talk about the NoIndex Meta Tag First
According to Google, “By default, Googlebot will index a page and follow links to it. So there’s no need to tag pages with content values of INDEX or FOLLOW.”
Many SEO companies will recommend the following Meta Tag:
<meta content=”noindex, follow”>
While it won’t hurt you to implement this, it is not needed, as Google will follow links if the following Meta Tag is implemented.