Robots.txt WordPress Plugin

This is another one of those handy plugins designed for people like myself, who just want to be able to set something up and then not worry about it again.

What the plugin does

You probably know that when a search engine spider visits your site, one of the first things it does is look for a file called robots.txt which tells it which files and folders it can go and look at. By default, WordPress lets every robot go everywhere. That might be ok for some people, but I prefer to exercise a bit more control over things.

For example, if the robot identifies itself as a bad bot – yes, some of them do – then I don’t want it to go anywhere. All it’s probably going to do is trawl for email addresses to add to a spam list somewhere. And I don’t really want any robots poking their noses into such places as the WordPress admin folder. Control freak? Me? Don’t know what you mean…

The solution is to add the names of bad bots to your robots.txt file and disallow them from going anywhere, and add the names of common search engine spiders and specify which locations or files they are allowed to visit.

This plugin will do that in a completely hands free way by setting up a virtual robots.txt file for your blog as soon as it’s activated. Whenever a request for a robots.txt file comes in, WordPress will display the contents of your virtual robots.txt file. No physical file is created on your site but one is shown to the search engine bot.

By default, your virtual robots.txt file will have Google’s Mediabot allowed, a bunch of spam-bots disallowed, and a few of the standard WordPress folders and files disallowed. The default collection of bad bots is borrowed from http://www.clickability.co.uk/robotstxt.html.

Ok, even though it’s completely automated and hands free I admit there are times when I want to tweak what’s contained in the virtual robots.txt file. There’s now a handy options page which lets you edit the contents.

Oh yeah, and if you mess up your robots.txt file you can just deactivate and reactivate the plugin and it will revert back to the default list of rules.

Also, if the plugin detects an existing sitemap.xml file (or if you are using my XML Sitemap plugin) it will add a reference to your sitemap.xml to the end of the robots.txt file. I’m told this helps with the discovery of your sitemap.xml and indexing of your pages. That’s got to be a good idea.

How to use the plugin

With the plugin now being hosted on WordPress, the easiest way to install this baby is to visit your blog admin pages, click the Plugins menu, and then click the Add New menu. In the search box type something like “robots.txt” and with a bit of luck you should see PC Robots.txt in the list that appears. To the right of it you’ll see a link to Install the plugin. Click that.

If you happen to be using the version that was hosted on this site, please delete it and install a new version using the instruction above. That way you’ll always have the latest version and you’ll get notified of updates and such by WordPress.

The official download page is at http://wordpress.org/extend/plugins/pc-robotstxt/

And please do give me a shout to let me know if it works for you or not :-)

Comments

  1. This is a truly elegant solution to a total PITA.

    Top job!

  2. Does a robots.txt file help a blog get re-indexed in Google, if you’ve been kicked out?

    My site was delisted from Google, I suspect for syndicating headlines of everyone elses content with a link back to the full text article. Do you know how I can fix this?

    Thanks kindly,
    Shawn

  3. Can this be used in conjunction with the google xml sitemaps plugin?

  4. Hi Mark. I can’t think of any reason why the robots.txt plugin shouldn’t be used with the google xml sitemaps plugin.

  5. Hey Peter,

    GREAT PLUGIN! It’s the only robots.txt that can properly be automated in wordpress. I’m impressed. Thanks very much. Works as advertised.

    I’ll be reviewing your plugin on my blog, if you like, http://www.usingwp.com.

  6. I’m getting a 404 File Not Found error when I preview my robots.txt file. It’s pointing to http://www.pat-phillips-homes.com/robots.txt. Is it because robots.txt is a virtual file and doesn’t actually exist in the root directory, and so I’ll get the error?

    I also am using the XML Sitemap Generator for WordPress 3.1.3 plugin and have UNCHECKED the setting: “Add sitemap URL to the virtual robots.txt file.
    The virtual robots.txt generated by WordPress is used. A real robots.txt file must NOT exist in the blog directory!” Should it be checked? I thought it might conflict with your plugin.

    • @Pat – It looks like there’s something odd going on somewhere – your 404 message has a strange path for the robots.txt file it couldn’t find. I’m not sure what would cause that, but maybe you could try deactivating your sitemap generator plugin and previewing the robots.txt file again?

  7. Well I tried deactivating Google XML Sitemap and changed my permalinks setting back to default instead of structured and neither action made any difference. Still getting the 404 error: /static/pat-phillips-homes.com//robots.txt’ was not found on this server. Should I go ahead and create a robots.txt file copy the code that your plugin generated and paste it into the robots.txt file and upload it to my server. Then delete the two plugins since I now have a physical robots.txt file which now I can point Google Webmaster Tools to?

  8. Hey what does it mean by disallow /
    on a bunch of things do I need to edit the file?

    User-agent: Telesoft
    Disallow: /

    User-agent: The Intraformant
    Disallow: /

    ..etc.

  9. I tried a validator and says I have a lot of errors and warnings I noticed there were multiple names of some like webextractor in there twice. Any idea what else I need to do to change this or can I leave it?

    • Hi Amamda. Yes, there were a couple of duplicate entries in there – don’t know how I missed those.. I have removed them and updated the plugin files on wordpress.org so if you re-install the plugin you should be good to go. I also ran it through a validator and it comes up clean. Thanks for letting me know about the errors.

      The “Disallow: /” line means disallow access to the complete domain for whichever “User-agent” is listed above it. Unless you want to make specific exceptions you shouldn’t need to edit anything.

  10. I have the same problem like Pat. When I check the page http://www.pawel-trzepiota.pl/robots.txt there is 404 message. But if I check http://www.pawel-trzepiota.pl/index.php/robots.txt the plugin generates the file.

    Now I wonder if I should add a rewrite clause for robots.txt… but the way you describe it it should work without it.

  11. @mac13 – that’s interesting, I notice all your links begin with /index.php/ – do you have that as part of your permalink structure?

  12. Self response – I’ve added ErrorDocument for index.php and it works….

  13. Well it was. Now I’ve changed it to do it without index.php and with the ErrorDoc 404 pointed to index.php it works.

  14. Hello, I just want to be sure of something before I use this plugin. Since SEO is very important for my site and I know very little about WP folder/file structure can you just reassure me that your list in PC Robots.txt won’t block any pages/posts that I create?

    Disallow: /wp-admin/
    Disallow: /wp-includes/
    Disallow: /wp-content/plugins/
    Disallow: /wp-content/cache/
    Disallow: /wp-content/themes/
    Disallow: /wp-login.php
    Disallow: /wp-register.php

  15. @Marko – good for you for asking! Everything in that list can safely be blocked without it affecting your posts and pages.

  16. Thank you Peter :)

    I suppose since my post and pages are dynamic those can’t be blocked using robots.txt? The only way would be to use robots meta element on them?

    And thank you for the plugin very much!

  17. @Marko – You’re welcome! You can block your posts and pages by adding the post or page URL to the robots.txt file with a Disallow instruction, but as you mention I prefer to add a meta tag..

  18. Hello Peter, me again :)

    I just saw on WordPress Version 2.8 Release Notes http://codex.wordpress.org/Version_2.8 that Login and Registration pages are made noindex followed by default. Maybe now you would like to remove those locations from your PC Robots.txt. I know it can’t hurt to leave them there, but I just wanted to give you a heads up.

    Cheers!

  19. @Marko – thanks for that, but I will leave those lines in there for the people not yet using 2.8..

  20. One question: not particularly related to your plugin but here it is. I have 2 sites: one in root server and other one in subfolder:
    1. /
    2. /subfolder/
    Both have their own URLs which can be accessed through HTTP. I know that when robot comes to index site in the root it would index the other site in subfolder too as a part of it. But I want to prevent indexing of the site in subfolder as a part of the first site because they are not related at all.

    My idea:
    I want to put robots.txt in root to forbid that subfolder and thus prevent indexing the second site with the first. But I also want to put robots.txt in subfolder and allow indexing. I think this is possible since the second site has its own HTTP URL.

    Will this work? Thank you and sorry for boring you. :)

  21. Thank you, thank you, thank you… This is a plugin that everyone who cares about SEO on their WP blogs should install as a default.

    In fact, why doesn’t the privacy page of WP already offer this vitally-important functionality? Do WP admins/coders not care about SEO at all?

    Thanks again,
    Saul

  22. @marko — I’m just guessing here, but I think you might be able to do it. Robots only look for robots.txt files in the root of a domain, so if you disallow site 2 (/subfolder/) in the robots.txt file for site 1, it shouldn’t be indexed as part of site 1. If you then allow site 2 to be indexed using it’s own robots.txt file (in /subfolder/) it should be indexed as a separate domain. Well that’s the theory at least.. you can run a quick check in Google webmaster tools to see if it works..

  23. @Saul — thank you for such a great comment – very much appreciated!

    Regards,
    Peter.

  24. Thanks Peter for your opinion very much, it does work!

  25. I noticed you automatically included this in your plugin’s virtual robots.txt:
    User-agent: Googlebot
    Disallow:

    Wouldn’t that prevent Google from indexing one’s site… and hence, result in decrease of search engine traffic?

  26. @Fruity – Whatever comes after the Disallow statement is disallowed.. so if there’s nothing after it, then nothing is disallowed, or in other words everything is allowed. I know it sounds backwards but it is in fact the proper way of allowing full access…

  27. I just installed this plugin and it works great. I use the “Google XML Sitemaps” by Arne Brachhold and this creates a sitemap.xml and sitemap.xml.gz automatically.
    So is it possible to enhance your check to both files and add the zipped version to the virtual robots.txt, if such a file exists?

  28. @Henning — Glad the plugin is working great. Yes, good idea about the two types of sitemap. Look for an update shortly…

  29. Wow … fast reaction! Thanks for the update.

  30. Peter when I installed this I looked at the default values and saw that Google bots were under Disallow. You might want to check the default values in the version over at WordPress.org. I corrected the values to Allow.

    Is Allow necessary if the default is to allow all agents with disallow used to prevent bots you don’t want?

  31. @Doug – take a look at http://www.robotstxt.org/orig.html#format for the official format. You can use an Allow statement, but it isn’t universally supported.. and you’re right that it wouldn’t be necessary if all pages are allowed by default.

  32. Great plugin – superbly done. Thank you.

    Not included in your list – just noted this one mentioned on an info site … not sure as to whether I ought to add it to the list …

    User-agent: ia_archiver-web.archive.org
    Disallow: /

  33. Hello Peter. Thanks for this great plugin. What would novices like me do without gurus like yourself?!

    My blog is in a subdirectory like Mark’s above. Does it mean I need to make a subdomain like http://subdomain.domain.tld to allow bots access my WP blog separate from the rest of the site?

    I don’t see how the method you recommended above can work otherwise.

  34. is there a changelog? if not, there ought to be. some people actually like to know what changed ;)

  35. Hi, once I install the Robots.txt plugin, do I need to make any changes to the settings? I’m afraid I don’t quite understand what to do once it’s installed!

  36. Wow! Thanks for the tip. I have been looking for any easy way to do robots.txt all day and this is perfect for someone like me who is not so tech savvy.

  37. The plugin is not updating my file. When I preview the robot.txt file it still says
    User-agent: *
    Disallow: /

    Do I need to reactivate the plugin or will it change after a period of time?

  38. @Ciuly – when I update the plugin using the wordpress SVN it asks for a note to go with the update, which I always provide. I confess I thought those notes appeared on wordpress.. ah well, I will put one on this page too.. thanks for letting me know..

  39. @Amy – if you don’t understand what the plugin does then you probably don’t need it.. wordpress will work perfectly well without it. If you still want to use it anyway, there’s no need to change any of the settings, just activate it and you’re good to go.

  40. @chris – it looks like you might have your blog privacy settings set to block the search engines. The plugin will only change your robots.txt if your blog is public..

  41. Peter – you were right. It was my privacy settings blocking the search engines. It’s all updated. Thank you.

  42. How do I change my virtual robots.txt on my wordpress blog to allow googlebot to crawl my site? I realized under privacy settings that my blog was not visible to search engines (why is that set to default?) so I checked it to be visible, but now my virtual robots.txt reads as:
    User-agent: *
    Disallow:
    How do I make it allow? I know creating a new robots.txt is not allowed?

  43. @Pet Society – Your robots.txt file is fine. Notice there is nothing after the Disallow: bit, that means nothing is disallowed, or in other words, everything is allowed..

  44. Oh ok, I was wondering if that made a difference, glad it does. Thank you for your help, great blog. I’ll be sure to keep coming back for my worpress help and info.

  45. Peter, thanks for the easy to use and follow plug-in! Quick question, since Bing, Yahoo, Ask and other “safe” robots aren’t listed does that mean they automatically have access to crawl the site? I assume it does, just want to make sure. Thanks!

  46. @Greg – Yes, you’ve got it right.. and thank you for the kind words..

  47. I installed your plugin and I am seeing the following among other code when I look at the robots file

    User-agent: Googlebot
    Disallow:

    Does this mean to disallow google? If so, what do to allow google, msn, yahoo and small meta search engines. Thanks.

  48. @Peter – please read the previous comments on this page – that question has been answered a few times already :-)

  49. Hi

    I am running the latest wp, with google sitemap, as the post last year I get webpage could not be found when I click on preview my file.
    I am using the default permalinks, does that cause a problem. Deactivated it then reactivated it.

    Any tips would be most appreciated.

    Cheers

  50. Hi Peter, I have installed your plugin, but id doesn’t work, I only get 404 Not Found.
    Oh, I use WP 2.8…:(

  51. I get the 404 error that a couple of others have mentioned, but I can’t seem to resolve the issue. Regardless of whether I manually put a robots.txt or remove robots.txt in my root directory (pagsa.missouri.edu/robots.txt), this plugin does not save changes to the file. I am using the “Google XML Sitemaps” plugin, and have tried using it with and without checking “Add sitemap URL to the virtual robots.txt file”.

  52. Installed easily, but when I click on –
    “You can preview your robots.txt file”
    (under “PC Robots.txt Settings”)
    I get –
    “Page Cannot be Found, Please try Search or Return to HomePage”.
    Any advice greatly appreciated. (Concerned improper install, might hurt with SEs…
    Thanks for the plugin! – Ken

    A

  53. Peter, Google’s info on robots.txt says:

    If you want search engines to index everything in your site, you don’t need a robots.txt file (not even an empty one).

    So, if you’re activating the plug-in without making any amendments /blocking page/s will it harm your site in any way from Google’s point of view?

  54. You need custom permalinks for the plugin to work. When you have custom permalinks enabled wordpress will process request for non-existent files (including robots.txt) but if you don’t have custom permalinks enabled wordpress wil never even see the request.. you will just get a 404 not found error..

  55. @Mark — if you don’t have a specific reason to use this plugin, i.e. to block spambots, then you don’t need to use it at all. But saying that, it won’t do any harm if you do use it..

  56. Hey Thanks Peter, had tried manual uploading of the robots.txt and uploaded it on the wrong directory in cpanel. Deleted and installed this plugin . Thanks :D

  57. I hoped that installing this plugin would correct all of the crawl errors. It has still not resolved – I installed it two days ago and as of yesterday, I had yet another Google crawl error according to my Google analytics account. Any suggestions? I didn’t change anything in the plugin, just activated it. I can’t get any response on the Google Forum, and I am at a loss here. This may be a stupid question, but I will ask anyway – does this affect my adsense account too? I mean am I missing out on an adsense income because of crawl errors?

  58. @Mickey – I just checked your robots.txt file and it looks fine so I’m not sure what the problem might be.. do you get any specific errors?

  59. hey peter.

    Installed your plugin. all good. i just need to make sure, as this is the first time i am using your plugin…

    prior to installing your plugin, i have gone into cpanel > file manager and manually copied and pasted some *disallow* rules…in other words i have a physical robot.txt file in place on my blog…

    now that i have your plugin installed, am i to understand that i should now delete that code and totally rely on the virtual robots.txt file??

    BTW, if it is *easily* explained, why should one *not* have a physical robots.txt file, out of curiosity??!!?? i mean, if one did not use your plugin, could one not insert the code manually, as i have done, and just leave it?? after all, it’s doing it’s job??

    cheers for the clarification,

    phil

  60. @phil — no reason why you can’t use a physical robots.txt file instead.. yes, you could just take the code from the plugin and use it manually. The plugin is just designed to make things as easy as possible..

  61. Thy, i was looking for this one. Im using the sitemap plugin which makes use of the virtual robots.txt but i would like to add my own custom stuff.

  62. Tnx for the plug, it’s simply great, one tiny little question tho.
    Can I safely put Asimov’s rules at the top of the document, or does that throw an error which stops parsing?

    Disallow: /harming/humans
    Disallow: /ignoring/human/orders
    Disallow: /harm/to/self

  63. @Jenski – LOL.. I like it..

  64. Peter, I have your plugin installed but when I go to webmaster tools crawler access it says the robots.txt file is Status 404 (not found). In my wordpress admin I do have the privacy setting set to , “I would like my site to be visible to everyone, including search engines”
    When I visit the PC Robots.txt plugin and click preview your robots.txt file it says “Not Found The requested ULR/robots.txt was not found on this server. I’m not sure what to do next. Please advise. Thanks.

  65. @Greg, make sure you’ve got your permalinks set to something other than the default.. otherwise wordpress won’t see the requests for the robots.txt..

  66. Awesome plug-in! That’s what we needed for our websites. Thanks a lot Peter!

Trackbacks

  1. [...] a cruise over to and check out his plugin, Robots.txt WordPress plugin. Now, this is a simple but slick plugin. And does it work [...]

  2. [...] Robots.txt Tell search engines where they can and cannot crawl. This is an important file that many people forget about, but this plugin will help you through that. [...]

  3. [...] PC robots.txt – WP Directory Download • Plugin Homepage Automatically creates a robots.txt file for your site to block spam bots and allow Google crawler [...]

  4. [...] more information, to offer feedback, to ask questions or to submit bugs please visit http://petercoughlin.com/robotstxt-wordpress-plugin/ [...]

Speak Your Mind

*