According to Google, each protocol (here we’re talking about HTTP and HTTPS) should have it’s own robots.txt file. This makes sense, especially seeing as Google will see https://www.yoursite.com/ and http://www.yoursite.com/ as different web sites - which they’re probably not. Duplication is a bad thing.

I’ve recently noticed quite a few forum posts and blogs telling you how to redirect all HTTPS pages to HTTP using fancy regular expressions. Now don’t get me wrong, about 60% of my day is spent writing regex to do the other 40% of my work for me. But there’s a far simpler way. Why not just disallow Google to spider everything on HTTPS?

Sounds simple, but most of the time you wont have separate file structures for different protocols. If that’s your problem too, then use this handy snippet to seve a different robots.txt when the request is made over HTTPS.


# serve different robots.txt for https
RewriteCond %{SERVER_PORT} ^443$ [OR]
RewriteCond %{HTTPS} =on
RewriteRule ^robots\.txt$ robots_ssl.txt [L]

Now just create robots_ssl.txt and pop what Google suggest in to it:


User-agent: *
Disallow: /

No more duplication. Horah!

[Edit: thanks to Aahan for pointing out a typo in original post :)]

Tagged with:
 

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>