Removing duplicate search engine content using robots.txt

Here’s something that no webmaster wants to see:

Screenshot showing that Google cannot access the homepage due to a robots.txt restriction

It’s part of a screenshot from the Google Webmaster Tools that says “[Google] can’t current access your home page because of a robots.txt restriction”. Arghh!

This came about because, a couple of nights back, I made some changes to the website in order to remove the duplicate content in Google. Google (and other search engines) don’t like duplicate content, so by removing the archive pages, categories, feeds, etc. from their indexes, I ought to be able to reduce the overall number of pages from this site that are listed and at the same time increase the quality of the results (and hopefully my position in the index). Ideally, I can direct the major search engines to only index the home page and individual item pages.

I based my changes on some information on the web that caused me a few issues – so this is what I did and by following these notes, hopefully others won’t repeat my mistakes; however, there is a caveat – use this advice with care – I’m not responsible for other people’s sites dropping out of the Google index (or other such catastrophes).

Firstly, I made some changes to the section in my WordPress template:







Because WordPress content is generated dynamically, this tells the search engines which pages should be in, and which should be out, based on the type of page. So, basically, if this is an post page, another single page, or the home page then go for it; otherwise follow the appropriate rule for Google, MSN or other spiders (Yahoo! and Ask will follow the standard robots directive) telling them not to index or archive the page but to follow any links and additionally, for Google not to include any open directory information. This was based on advice from askapache.com but amended because the default indexing behaviour for spiders is to index, follow or all so I didn’t need to specify specific rules for Google and MSN as in the original example (but did need something there otherwise the logic reads “if condition is met donothing else dosomething” and the donothing could be problematic) .

Next, following fiLi’s advice for using robots.txt to avoid content duplication, I started to edit my robots.txt file. I won’t list the file contents here – suffice to say that the final result is visible on my web server and for those who think that publishing the location of robots.txt is a bad idea (because the contents are effectively a list of places that I don’t want people to go to), then think of it this way: robots.txt is a standard file on many web servers, which by necessity needs to be readable and therefore should not be used for security purposes – that’s what file permissions are for (one useful analogy refers to robots.txt as a “no entry” sign – not a locked door)!

The main changes that I made were to block certain folders:

Disallow: /blog/page
Disallow: /blog/tags
Disallow: /blog/wp-admin
Disallow: /blog/wp-content
Disallow: /blog/wp-includes
Disallow: /*/feed
Disallow: /*/trackback

(the trailing slash is significant – if it is missing then the directory itself is blocked, but if it is present then only the files within the directory are affected, including subdirectories).

I also blocked certain file extensions:

Disallow: /*.css$
Disallow: /*.html$
Disallow: /*.js$
Disallow: /*.ico$
Disallow: /*.opml$
Disallow: /*.php$
Disallow: /*.shtml$
Disallow: /*.xml$

Then, I blocked URLs that include ? except those that end with ?:

Allow: /*?$
Disallow: /*?

The problem at the head of this post came about because I blocked all .php files using

Disallow: /*.php$

As https://www.markwilson.co.uk/blog/ is equivalent to https://www.markwilson.co.uk/blog/index.php then I was effectively stopping spiders from accessing the home page. I’m not sure how to get around that as both URLs are serving the same content, but in a site of about 1500 URLs at the time of writing, I’m not particularly worried about a single duplicate instance (although I would like to know how to work around the issue). I resolved this by explicitly allowing access to index.php (and another important file – sitemaps.xml) using:

Allow: /blog/index.php
Allow: /sitemap.xml

It’s also worth noting that neither wildcards (*, ?) nor allow are valid robots.txt directives and so the file will fail validation. After a bit of research I found that the major search engines have each added support for their own enhancements to the robots.txt specification:

  • Google (Googlebot), Yahoo! (Slurp) and Ask (Teoma) support allow directives.
  • Googlebot, MSNbot and Slurp support wildcards.
  • Teoma, MSNbot and Slurp support crawl delays.

For that reason, I created multiple code blocks – one for each of the major search engines and a catch-all for other spiders, so the basic structure is:

# Google
User-agent: Googlebot
# Add directives below here

# MSN
User-agent: msnbot
# Add directives below here

# Yahoo!
User-agent: Slurp
# Add directives below here

# Ask
User-agent: Teoma
# Add directives below here

# Catch-all for other agents
User-agent: *
# Add directives below here

Just for good measure, I added a couple more directives for the Alexa archiver (do not archive the site) and Google AdSense (read everything to determine what my site is about and work out which ads to serve).

# Alexa archiver
User-agent: ia_archiver
Disallow: /

# Google AdSense
User-agent: Mediapartners-Google*
Disallow:
Allow: /*

Finally, I discovered that Google, Yahoo!, Ask and Microsoft now all support sitemap autodiscovery via robots.txt:

Sitemap: http://www.markwilson.co.uk/sitemap.xml

This can be placed anywhere in the file, although Microsoft don’t actually do anything with it yet!

Having learned from my initial experiences of locking Googlebot out of the site, I checked the file using the Google robots.txt analysis tool and found that Googlebot was ignoring the directives under User-agent: * (no matter whether that section was first or last in the file). Thankfully, posts to the help groups for crawling, indexing and ranking and Google webmaster tools indicated that Googlebot will ignore generic settings if there is a specific section for User-agent: Googlebot. The workaround is to include all of the generic exclusions in each of the agent-specific sections – not exactly elegant but workable.

I have to wait now for Google to re-read my robots.txt file, after which it will be able to access the updated sitemap.xml file which reflects the exclusions. Shortly afterwards, I should start to see the relevance of the site:www.markwilson.co.uk results improve and hopefully soon after that my PageRank will reach the elusive 6.

Links

Google webmaster help center.
Yahoo! search resources for webmasters (Yahoo! Slurp).
About Ask.com: Webmasters.
Windows Live Search site owner help: guidelines for succesful indexing and controlling which pages are indexed.
The web robots pages.

Where are the WVP2 codecs for QuickTime on a Mac?

It’s generally accepted that Macs are great computers for graphic design and audio-visual work – so why is it so hard to play Windows Media content on a Mac? I know that QuickTime is the centre of Apple’s audio-visual experience – so why should Apple support competing formats – but perhaps I should really ask why the various software companies have seen fit to introduce such a myriad of audio and video codecs? I’m a techie and I can only just keep up – think about the poor consumer who just wants to share some family videos with the grandparents!

The trouble is that Microsoft, as the developer of the most widely installed operating system on the planet (with a correspondingly huge number of multimedia file formats as described in Microsoft knowledge base article 316922), has seen fit to dump development of Windows Media products for other platforms. Quoting part of the Wikipedia article on Windows Media Player:

Version 9 was the final version of Windows Media Player to be released for Mac OS X before development was cancelled by Microsoft. WMP for Mac OS X received widespread criticism from Mac users due to poor performance and features. Developed by the Windows Media team at Microsoft instead of the Macintosh Business Unit and released in 2003, on release the application lacked many basic features that were found in other media players such as Apple’s iTunes and QuickTime Player. It also lacked support for many media formats that version 9 of the Windows counterpart supported on release 10 months earlier.

The Mac version supported only Windows Media encoded media (up to version 9) enclosed in the ASF format, lacking support for all other formats such as MP4, MPEG, and Microsoft’s own AVI format. On the user interface front, it did not prevent screensavers from running during playback, it did not support file drag-and-drop, nor did it support playlists. While Windows Media Player 9 had added support for some files that use the WMV9 codec (also known as the WMV3 codec), in other aspects it was seen as having degraded in features from previous versions.

On January 12, 2006 Microsoft announced it had ceased development of Windows Media Player for Mac.[4] Microsoft now distributes a third-party plugin called WMV Player (produced and maintained by Flip4Mac) which allows some forms of Windows Media to be played within Apple’s QuickTime player and other QuickTime-aware applications.[5] Mac users can also use the free software media player VLC, which is also able to play WMV-3 / WMV-9 / VC-1 Windows Media files.

It seems that the Flip4Mac WMV Player, which should provide the missing Windows Media support for Mac users (as endorsed by Microsoft) does not support all Windows Media codecs, namely it refuses to play content encoded with the Windows Media Video 9 Image v2 (WVP2) codec.

I can understand Microsoft’s position – after all they want to preserve their market share – so why doesn’t Apple make it easier for switchers with legacy video content? As the iLife applications are such a selling point for Apple, why not make it easier to convert from the Windows equivalents?

My problem is that, for the last few years, I’ve been creating home video content using Windows Movie Maker and Photo Story. They may not be the best video applications in the world but they are fine for movies of holidays and the kids and are included with Windows XP (well, Movie Maker is – Photo Story is a free add on). Nowadays, I have a Mac but I still want to play my old content.  The resulting WMV content from Movie Maker hasn’t caused too many problems as it uses the Windows Media Audio 9.1 and Windows Media Video 9 (WMV3) codecs and simply needs appropriate QuickTime components to be installed. Unfortunately the Photo Story output refuses to play the (WVP2) video track in either QuickTime (WMV Player) or Windows Media Player for Mac OS X and as far as I can tell there are no suitable codecs available.

In desperation, I went back to PhotoStory and tried to export in another format but there is no such option (it supports various screen sizes and frame rates but they all seem to be using the same codec).

One macKB thread suggests using Dr Div X to convert the file but the latest version of Dr DivX failed (on both Windows and Mac); similarly the DivX Converter didn’t work for me.

Eventually, I found a utility that could convert the file for me (Advanced X Video Converter) – it’s done a good job although whilst the quality is acceptable for my home movies there are some visible compression artifacts (I used the H264 video and 24bit audio codecs to convert to a .MOV file). In fairness, the compression artifacts may also be visible in the original WMV file and anyway they are hardly surprisingly as the video was created from compressed JPEG and MP3 files, which have then been compressed to WMV and once more to MOV so the quality is certain to have suffered along the way. What’s possibly of greater concern is the resulting increase in file size – up from 19.5MB to 431.4MB.

I’m glad I got there in the end – for a while it seemed that I would have to keep a Windows virtual machine just to play old home movies – and there I was, naively believing that converting to digital capture and storage would save me from issues with legacy formats.