If you work in SEO, this type of message will be familiar to you. Otherwise here is an explanation: robots.txt, also known as Robots Exclusion Protocol, is a file saved at the root of your domain which is used to control web robots. Search Engines and web-based software send these robots (also known as spiders, bots or crawlers) on the web to crawl websites and gather information about them. Some are good and useful, some are bad and annoying. So the robots.txt file can be used to “disallow” access to spiders you don’t want to be visited by. Most interestingly it can be used to forbid crawling parts of your site too. However, this is a public file and in no circumstance you should put anything sensitive which could affect the security of your site. As an example, this is how you can access mine and this is what you will find in it:
User-agent: * Disallow: /wp-admin/ Sitemap: https://www.abandonguild.com/sitemap_index.xml
As you can see I’ve disallowed the WordPress admin back-end area. OK, some will say it’s not a great thing to do but I don’t think it’s a bad call. Same thing for the Sitemap…
How is the robots.txt affecting me?
In my previous blog about “a tale of over-optimisation“, I mentioned I was not even ranking in Google for a nonsensical word named rhinalnesc. Then later it turns out that the page was not even crawled by the bots and the following message appeared in the SERP:
A description for this result is not available because of this site’s robots.txt – Learn more.
Initially, my theory was that I had stuffed my page inadvertently with the keyword and therefore it was not looked at by Google bots anymore. But why would the message above appear and remain there for a whole week!
Well I have another theory: sending conflicted messages to crawlers. Here is my logic:
- When I published the rhinalnesc challenge page, I put a noindex, follow meta robots whilst the robots.txt was still a “disallow all”. I just wanted to check the page and test a few things such as speed.
- The meta robots noindex,follow doesn’t forbid robots to crawl a page, it just say to them, don’t index it. Unfortunately I was sending conflicting messages because on one hand I was telling the bots: “don’t come to my site”(robots.txt) but on the other hand it was “check this new content without indexing it please” (meta robots noindex, follow).
- So I was playing yoyo with the bots. Worst the page was left on site published with this conflicted message for almost a week prior to 29th February.
Then I opened up my site to search engines crawlers on 29th February by removing the robots.txt disallow command and removing the meta robots noindex, follow. I also submitted the page to Google fetch and render. It worked and my page was indexed… only for a day! Even though robots were finally allowed to freely roam on my page, the keyword stuffing must have kicked-in, sending red alerts. The page was taken off the index on the second day. On the 3rd day, I finished the de-optimisation and resubmitted the page and even the sitemap. That’s when the robots.txt message started to appear.
I think Google didn’t like my over-optimised page in the first place so it didn’t index it but then when I resubmitted it, I guess it needed to reassess the page before drawing any conclusion. So instead of serving the correct meta data in the SERP, it served an old cached version of it which it found prior to 29th February. Because the site is still in its infancy and the crawl budget is very low, nothing happened for a whole week. An then finally, the page was indexed ans shoot almost straight to the top!
- don’t publish any content, even under a robots.txt “disallow all” command, when you don’t want this content to be indexed
- don’t use noindex, follow and the robots.txt “disallow all” command together for a temporary fix. In my case I should have left my page published as a “private” post in WordPress.
Finally take a look at the slideshare showing the Google SERP day by day on query Rhinalnesc from 29th February to 12th March: