Robots.txt WordPress Plugin

in Blogging, WordPress Plugins

This is another one of those handy plugins designed for people like myself, who just want to be able to set something up and then not worry about it again.

What the plugin does

You probably know that when a search engine spider visits your site, one of the first things it does is look for a file called robots.txt which tells it which files and folders it can go and look at. By default, WordPress lets every robot go everywhere. That might be ok for some people, but I prefer to exercise a bit more control over things.

For example, if the robot identifies itself as a bad bot – yes, some of them do – then I don’t want it to go anywhere. All it’s probably going to do is trawl for email addresses to add to a spam list somewhere. And I don’t really want any robots poking their noses into such places as the WordPress admin folder. Control freak? Me? Don’t know what you mean…

The solution is to add the names of bad bots to your robots.txt file and disallow them from going anywhere, and add the names of common search engine spiders and specify which locations or files they are allowed to visit.

This plugin will do that in a completely hands free way by setting up a virtual robots.txt file for your blog as soon as it’s activated. Whenever a request for a robots.txt file comes in, WordPress will display the contents of your virtual robots.txt file. No physical file is created on your site but one is shown to the search engine bot.

By default, your virtual robots.txt file will have Google’s Mediabot allowed, a bunch of spam-bots disallowed, and a few of the standard WordPress folders and files disallowed. The default collection of bad bots is borrowed from http://www.clickability.co.uk/robotstxt.html.

Ok, even though it’s completely automated and hands free I admit there are times when I want to tweak what’s contained in the virtual robots.txt file. There’s now a handy options page which lets you edit the contents.

Oh yeah, and if you mess up your robots.txt file you can just deactivate and reactivate the plugin and it will revert back to the default list of rules.

Also, if the plugin detects an existing sitemap.xml file (or if you are using my XML Sitemap plugin) it will add a reference to your sitemap.xml to the end of the robots.txt file. I’m told this helps with the discovery of your sitemap.xml and indexing of your pages. That’s got to be a good idea.

How to use the plugin

With the plugin now being hosted on WordPress, the easiest way to install this baby is to visit your blog admin pages, click the Plugins menu, and then click the Add New menu. In the search box type something like “robots.txt” and with a bit of luck you should see PC Robots.txt in the list that appears. To the right of it you’ll see a link to Install the plugin. Click that.

If you happen to be using the version that was hosted on this site, please delete it and install a new version using the instruction above. That way you’ll always have the latest version and you’ll get notified of updates and such by WordPress.

The official download page is at http://wordpress.org/extend/plugins/pc-robotstxt/

And please do give me a shout to let me know if it works for you or not :-)

{ 68 comments… read them below or add one }

Richard Brown September 5, 2008 at 12:15 pm

This is a truly elegant solution to a total PITA.

Top job!

Reply

Shawn September 5, 2008 at 12:18 pm

Does a robots.txt file help a blog get re-indexed in Google, if you’ve been kicked out?

My site was delisted from Google, I suspect for syndicating headlines of everyone elses content with a link back to the full text article. Do you know how I can fix this?

Thanks kindly,
Shawn

Reply

Mark May 27, 2009 at 4:00 pm

Can this be used in conjunction with the google xml sitemaps plugin?

Reply

Peter June 1, 2009 at 5:01 am

Hi Mark. I can’t think of any reason why the robots.txt plugin shouldn’t be used with the google xml sitemaps plugin.

Reply

Frank June 12, 2009 at 10:51 pm

Hey Peter,

GREAT PLUGIN! It’s the only robots.txt that can properly be automated in wordpress. I’m impressed. Thanks very much. Works as advertised.

I’ll be reviewing your plugin on my blog, if you like, http://www.usingwp.com.

Reply

Pat June 21, 2009 at 12:21 pm

I’m getting a 404 File Not Found error when I preview my robots.txt file. It’s pointing to http://www.pat-phillips-homes.com/robots.txt. Is it because robots.txt is a virtual file and doesn’t actually exist in the root directory, and so I’ll get the error?

I also am using the XML Sitemap Generator for WordPress 3.1.3 plugin and have UNCHECKED the setting: “Add sitemap URL to the virtual robots.txt file.
The virtual robots.txt generated by WordPress is used. A real robots.txt file must NOT exist in the blog directory!” Should it be checked? I thought it might conflict with your plugin.

Reply

Peter June 21, 2009 at 1:07 pm

@Pat – It looks like there’s something odd going on somewhere – your 404 message has a strange path for the robots.txt file it couldn’t find. I’m not sure what would cause that, but maybe you could try deactivating your sitemap generator plugin and previewing the robots.txt file again?

Reply

Pat June 21, 2009 at 3:19 pm

Well I tried deactivating Google XML Sitemap and changed my permalinks setting back to default instead of structured and neither action made any difference. Still getting the 404 error: /static/pat-phillips-homes.com//robots.txt’ was not found on this server. Should I go ahead and create a robots.txt file copy the code that your plugin generated and paste it into the robots.txt file and upload it to my server. Then delete the two plugins since I now have a physical robots.txt file which now I can point Google Webmaster Tools to?

Reply

Peter June 21, 2009 at 7:15 pm

@Pat – Yeah, I guess creating a physical robots.txt might be the easiest thing in your case.

Reply

amanda July 3, 2009 at 2:01 pm

Hey what does it mean by disallow /
on a bunch of things do I need to edit the file?

User-agent: Telesoft
Disallow: /

User-agent: The Intraformant
Disallow: /

..etc.

Reply

amanda July 3, 2009 at 2:22 pm

I tried a validator and says I have a lot of errors and warnings I noticed there were multiple names of some like webextractor in there twice. Any idea what else I need to do to change this or can I leave it?

Reply

Peter July 3, 2009 at 3:26 pm

Hi Amamda. Yes, there were a couple of duplicate entries in there – don’t know how I missed those.. I have removed them and updated the plugin files on wordpress.org so if you re-install the plugin you should be good to go. I also ran it through a validator and it comes up clean. Thanks for letting me know about the errors.

The “Disallow: /” line means disallow access to the complete domain for whichever “User-agent” is listed above it. Unless you want to make specific exceptions you shouldn’t need to edit anything.

Reply

MAC13 July 10, 2009 at 10:49 am

I have the same problem like Pat. When I check the page http://www.pawel-trzepiota.pl/robots.txt there is 404 message. But if I check http://www.pawel-trzepiota.pl/index.php/robots.txt the plugin generates the file.

Now I wonder if I should add a rewrite clause for robots.txt… but the way you describe it it should work without it.

Reply

Peter July 10, 2009 at 11:08 am

@mac13 – that’s interesting, I notice all your links begin with /index.php/ – do you have that as part of your permalink structure?

Reply

MAC13 July 10, 2009 at 11:22 am

Self response – I’ve added ErrorDocument for index.php and it works….

Reply

MAC13 July 11, 2009 at 1:41 am

Well it was. Now I’ve changed it to do it without index.php and with the ErrorDoc 404 pointed to index.php it works.

Reply

Marko July 18, 2009 at 12:08 pm

Hello, I just want to be sure of something before I use this plugin. Since SEO is very important for my site and I know very little about WP folder/file structure can you just reassure me that your list in PC Robots.txt won’t block any pages/posts that I create?

Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Disallow: /wp-login.php
Disallow: /wp-register.php

Reply

Peter July 18, 2009 at 12:16 pm

@Marko – good for you for asking! Everything in that list can safely be blocked without it affecting your posts and pages.

Reply

Marko July 18, 2009 at 6:37 pm

Thank you Peter :)

I suppose since my post and pages are dynamic those can’t be blocked using robots.txt? The only way would be to use robots meta element on them?

And thank you for the plugin very much!

Reply

Peter July 18, 2009 at 7:10 pm

@Marko – You’re welcome! You can block your posts and pages by adding the post or page URL to the robots.txt file with a Disallow instruction, but as you mention I prefer to add a meta tag..

Reply

Marko July 19, 2009 at 6:48 am

Hello Peter, me again :)

I just saw on WordPress Version 2.8 Release Notes http://codex.wordpress.org/Version_2.8 that Login and Registration pages are made noindex followed by default. Maybe now you would like to remove those locations from your PC Robots.txt. I know it can’t hurt to leave them there, but I just wanted to give you a heads up.

Cheers!

Reply

Peter July 19, 2009 at 6:52 am

@Marko – thanks for that, but I will leave those lines in there for the people not yet using 2.8..

Reply

Marko August 25, 2009 at 12:21 pm

One question: not particularly related to your plugin but here it is. I have 2 sites: one in root server and other one in subfolder:
1. /
2. /subfolder/
Both have their own URLs which can be accessed through HTTP. I know that when robot comes to index site in the root it would index the other site in subfolder too as a part of it. But I want to prevent indexing of the site in subfolder as a part of the first site because they are not related at all.

My idea:
I want to put robots.txt in root to forbid that subfolder and thus prevent indexing the second site with the first. But I also want to put robots.txt in subfolder and allow indexing. I think this is possible since the second site has its own HTTP URL.

Will this work? Thank you and sorry for boring you. :)

Reply

Saul August 27, 2009 at 2:24 pm

Thank you, thank you, thank you… This is a plugin that everyone who cares about SEO on their WP blogs should install as a default.

In fact, why doesn’t the privacy page of WP already offer this vitally-important functionality? Do WP admins/coders not care about SEO at all?

Thanks again,
Saul

Reply

Peter August 27, 2009 at 3:15 pm

@marko — I’m just guessing here, but I think you might be able to do it. Robots only look for robots.txt files in the root of a domain, so if you disallow site 2 (/subfolder/) in the robots.txt file for site 1, it shouldn’t be indexed as part of site 1. If you then allow site 2 to be indexed using it’s own robots.txt file (in /subfolder/) it should be indexed as a separate domain. Well that’s the theory at least.. you can run a quick check in Google webmaster tools to see if it works..

Reply

Peter August 27, 2009 at 3:17 pm

@Saul — thank you for such a great comment – very much appreciated!

Regards,
Peter.

Reply

Marko August 27, 2009 at 5:49 pm

Thanks Peter for your opinion very much, it does work!

Reply

fruityoaty September 11, 2009 at 10:24 am

I noticed you automatically included this in your plugin’s virtual robots.txt:
User-agent: Googlebot
Disallow:

Wouldn’t that prevent Google from indexing one’s site… and hence, result in decrease of search engine traffic?

Reply

Peter September 11, 2009 at 11:02 am

@Fruity – Whatever comes after the Disallow statement is disallowed.. so if there’s nothing after it, then nothing is disallowed, or in other words everything is allowed. I know it sounds backwards but it is in fact the proper way of allowing full access…

Reply

Henning September 17, 2009 at 11:48 am

I just installed this plugin and it works great. I use the “Google XML Sitemaps” by Arne Brachhold and this creates a sitemap.xml and sitemap.xml.gz automatically.
So is it possible to enhance your check to both files and add the zipped version to the virtual robots.txt, if such a file exists?

Reply

Peter September 17, 2009 at 2:26 pm

@Henning — Glad the plugin is working great. Yes, good idea about the two types of sitemap. Look for an update shortly…

Reply

Henning September 17, 2009 at 7:11 pm

Wow … fast reaction! Thanks for the update.

Reply

Doug Smith September 21, 2009 at 8:06 pm

Peter when I installed this I looked at the default values and saw that Google bots were under Disallow. You might want to check the default values in the version over at WordPress.org. I corrected the values to Allow.

Is Allow necessary if the default is to allow all agents with disallow used to prevent bots you don’t want?

Reply

Peter September 22, 2009 at 3:55 am

@Doug – take a look at http://www.robotstxt.org/orig.html#format for the official format. You can use an Allow statement, but it isn’t universally supported.. and you’re right that it wouldn’t be necessary if all pages are allowed by default.

Reply

Katie October 4, 2009 at 5:28 pm

Great plugin – superbly done. Thank you.

Not included in your list – just noted this one mentioned on an info site … not sure as to whether I ought to add it to the list …

User-agent: ia_archiver-web.archive.org
Disallow: /

Reply

Bassey October 19, 2009 at 5:25 pm

Hello Peter. Thanks for this great plugin. What would novices like me do without gurus like yourself?!

My blog is in a subdirectory like Mark’s above. Does it mean I need to make a subdomain like http://subdomain.domain.tld to allow bots access my WP blog separate from the rest of the site?

I don’t see how the method you recommended above can work otherwise.

Reply

Ciuly October 28, 2009 at 4:18 pm

is there a changelog? if not, there ought to be. some people actually like to know what changed ;)

Reply

Amy November 8, 2009 at 9:34 pm

Hi, once I install the Robots.txt plugin, do I need to make any changes to the settings? I’m afraid I don’t quite understand what to do once it’s installed!

Reply

Noelle November 8, 2009 at 9:43 pm

Wow! Thanks for the tip. I have been looking for any easy way to do robots.txt all day and this is perfect for someone like me who is not so tech savvy.

Reply

chris November 11, 2009 at 9:41 pm

The plugin is not updating my file. When I preview the robot.txt file it still says
User-agent: *
Disallow: /

Do I need to reactivate the plugin or will it change after a period of time?

Reply

Peter November 12, 2009 at 4:29 am

@Ciuly – when I update the plugin using the wordpress SVN it asks for a note to go with the update, which I always provide. I confess I thought those notes appeared on wordpress.. ah well, I will put one on this page too.. thanks for letting me know..

Reply

Peter November 12, 2009 at 4:33 am

@Amy – if you don’t understand what the plugin does then you probably don’t need it.. wordpress will work perfectly well without it. If you still want to use it anyway, there’s no need to change any of the settings, just activate it and you’re good to go.

Reply

Peter November 12, 2009 at 4:34 am

@chris – it looks like you might have your blog privacy settings set to block the search engines. The plugin will only change your robots.txt if your blog is public..

Reply

chris November 12, 2009 at 9:12 am

Peter – you were right. It was my privacy settings blocking the search engines. It’s all updated. Thank you.

Reply

Pet Society Help November 25, 2009 at 5:51 am

How do I change my virtual robots.txt on my wordpress blog to allow googlebot to crawl my site? I realized under privacy settings that my blog was not visible to search engines (why is that set to default?) so I checked it to be visible, but now my virtual robots.txt reads as:
User-agent: *
Disallow:
How do I make it allow? I know creating a new robots.txt is not allowed?

Reply

Peter November 25, 2009 at 6:04 am

@Pet Society – Your robots.txt file is fine. Notice there is nothing after the Disallow: bit, that means nothing is disallowed, or in other words, everything is allowed..

Reply

Pet Society Help November 25, 2009 at 10:14 pm

Oh ok, I was wondering if that made a difference, glad it does. Thank you for your help, great blog. I’ll be sure to keep coming back for my worpress help and info.

Reply

Greg November 30, 2009 at 7:00 pm

Peter, thanks for the easy to use and follow plug-in! Quick question, since Bing, Yahoo, Ask and other “safe” robots aren’t listed does that mean they automatically have access to crawl the site? I assume it does, just want to make sure. Thanks!

Reply

Peter December 1, 2009 at 4:31 am

@Greg – Yes, you’ve got it right.. and thank you for the kind words..

Reply

Peter December 1, 2009 at 11:45 am

I installed your plugin and I am seeing the following among other code when I look at the robots file

User-agent: Googlebot
Disallow:

Does this mean to disallow google? If so, what do to allow google, msn, yahoo and small meta search engines. Thanks.

Reply

Peter December 1, 2009 at 12:23 pm

@Peter – please read the previous comments on this page – that question has been answered a few times already :-)

Reply

jack December 9, 2009 at 4:12 pm

Hi

I am running the latest wp, with google sitemap, as the post last year I get webpage could not be found when I click on preview my file.
I am using the default permalinks, does that cause a problem. Deactivated it then reactivated it.

Any tips would be most appreciated.

Cheers

Reply

Peter January 11, 2010 at 1:58 pm

Hi Peter, I have installed your plugin, but id doesn’t work, I only get 404 Not Found.
Oh, I use WP 2.8…:(

Reply

MDMower January 13, 2010 at 2:52 pm

I get the 404 error that a couple of others have mentioned, but I can’t seem to resolve the issue. Regardless of whether I manually put a robots.txt or remove robots.txt in my root directory (pagsa.missouri.edu/robots.txt), this plugin does not save changes to the file. I am using the “Google XML Sitemaps” plugin, and have tried using it with and without checking “Add sitemap URL to the virtual robots.txt file”.

Reply

Ken January 24, 2010 at 10:16 am

Installed easily, but when I click on -
“You can preview your robots.txt file”
(under “PC Robots.txt Settings”)
I get -
“Page Cannot be Found, Please try Search or Return to HomePage”.
Any advice greatly appreciated. (Concerned improper install, might hurt with SEs…
Thanks for the plugin! – Ken

A

Reply

Mark February 5, 2010 at 10:37 am

Peter, Google’s info on robots.txt says:

If you want search engines to index everything in your site, you don’t need a robots.txt file (not even an empty one).

So, if you’re activating the plug-in without making any amendments /blocking page/s will it harm your site in any way from Google’s point of view?

Reply

Peter February 8, 2010 at 10:00 am

You need custom permalinks for the plugin to work. When you have custom permalinks enabled wordpress will process request for non-existent files (including robots.txt) but if you don’t have custom permalinks enabled wordpress wil never even see the request.. you will just get a 404 not found error..

Reply

Peter February 8, 2010 at 10:03 am

@Mark — if you don’t have a specific reason to use this plugin, i.e. to block spambots, then you don’t need to use it at all. But saying that, it won’t do any harm if you do use it..

Reply

Kaushal February 11, 2010 at 1:40 pm

Hey Thanks Peter, had tried manual uploading of the robots.txt and uploaded it on the wrong directory in cpanel. Deleted and installed this plugin . Thanks :D

Reply

Mickey September 2, 2010 at 3:00 pm

I hoped that installing this plugin would correct all of the crawl errors. It has still not resolved – I installed it two days ago and as of yesterday, I had yet another Google crawl error according to my Google analytics account. Any suggestions? I didn’t change anything in the plugin, just activated it. I can’t get any response on the Google Forum, and I am at a loss here. This may be a stupid question, but I will ask anyway – does this affect my adsense account too? I mean am I missing out on an adsense income because of crawl errors?

Reply

Peter September 2, 2010 at 3:28 pm

@Mickey – I just checked your robots.txt file and it looks fine so I’m not sure what the problem might be.. do you get any specific errors?

Reply

Phil October 23, 2010 at 7:50 am

hey peter.

Installed your plugin. all good. i just need to make sure, as this is the first time i am using your plugin…

prior to installing your plugin, i have gone into cpanel > file manager and manually copied and pasted some *disallow* rules…in other words i have a physical robot.txt file in place on my blog…

now that i have your plugin installed, am i to understand that i should now delete that code and totally rely on the virtual robots.txt file??

BTW, if it is *easily* explained, why should one *not* have a physical robots.txt file, out of curiosity??!!?? i mean, if one did not use your plugin, could one not insert the code manually, as i have done, and just leave it?? after all, it’s doing it’s job??

cheers for the clarification,

phil

Reply

Peter October 23, 2010 at 7:57 am

@phil — no reason why you can’t use a physical robots.txt file instead.. yes, you could just take the code from the plugin and use it manually. The plugin is just designed to make things as easy as possible..

Reply

Pascal November 12, 2010 at 11:38 am

Thy, i was looking for this one. Im using the sitemap plugin which makes use of the virtual robots.txt but i would like to add my own custom stuff.

Reply

Jenski March 18, 2011 at 7:45 am

Tnx for the plug, it’s simply great, one tiny little question tho.
Can I safely put Asimov’s rules at the top of the document, or does that throw an error which stops parsing?

Disallow: /harming/humans
Disallow: /ignoring/human/orders
Disallow: /harm/to/self

Reply

Peter March 18, 2011 at 7:52 am

@Jenski – LOL.. I like it..

Reply

Greg August 18, 2011 at 9:50 am

Peter, I have your plugin installed but when I go to webmaster tools crawler access it says the robots.txt file is Status 404 (not found). In my wordpress admin I do have the privacy setting set to , “I would like my site to be visible to everyone, including search engines”
When I visit the PC Robots.txt plugin and click preview your robots.txt file it says “Not Found The requested ULR/robots.txt was not found on this server. I’m not sure what to do next. Please advise. Thanks.

Reply

Peter August 18, 2011 at 5:27 pm

@Greg, make sure you’ve got your permalinks set to something other than the default.. otherwise wordpress won’t see the requests for the robots.txt..

Reply

Leave a Comment

{ 8 trackbacks }

Previous post:

Next post: