{"id":772,"date":"2007-04-17T22:42:45","date_gmt":"2007-04-17T21:42:45","guid":{"rendered":"http:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm"},"modified":"2012-04-02T23:48:24","modified_gmt":"2012-04-02T22:48:24","slug":"removing-duplicate-search-engine-content-using-robotstxt","status":"publish","type":"post","link":"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm","title":{"rendered":"Removing duplicate search engine content using robots.txt"},"content":{"rendered":"<p>Here&#8217;s something that no webmaster wants to see:<\/p>\n<p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/www.markwilson.co.uk\/blog\/images\/google-robots-restrictions.png?w=700&#038;ssl=1\" alt=\"Screenshot showing that Google cannot access the homepage due to a robots.txt restriction\" \/><\/p>\n<p>It&#8217;s part of a screenshot from the <a href=\"https:\/\/www.google.com\/webmasters\/tools\/\">Google Webmaster Tools<\/a> that says &#8220;[Google] can&#8217;t current access your home page because of a robots.txt restriction&#8221;. Arghh!<\/p>\n<p>This came about because, a couple of nights back, I made some changes to the website in order to remove the duplicate content in Google. Google (and other search engines) don&#8217;t like duplicate content, so by removing the archive pages, categories, feeds, etc. from their indexes, I ought to be able to reduce the overall number of pages from this site that are listed and at the same time increase the quality of the results (and hopefully my position in the index). Ideally, I can direct the major search engines to only index the home page and individual item pages.<\/p>\n<p>I based my changes on some information on the web that caused me a few issues &#8211; so this is what I did and by following these notes, hopefully others won&#8217;t repeat my mistakes; however, there is a caveat &#8211; use this advice with care &#8211; I&#8217;m not responsible for other people&#8217;s sites dropping out of the Google index (or other such catastrophes).<\/p>\n<p>Firstly, I made some changes to the <code><head><\/code> section in my WordPress template:<\/p>\n<p><dirtycode:noclick><?php if(is_single() || is_page() || is_home()) { ?><br \/>\n<meta name=\"robots\" content=\"all\" \/><br \/>\n<?php } else { ?><br \/>\n<meta name=\"googlebot\" content=\"noindex,noarchive,follow,noodp\" \/><br \/>\n<meta name=\"robots\" content=\"noindex,noarchive,follow\" \/><br \/>\n<meta name=\"msnbot\" content=\"noindex,noarchive,follow\" \/><br \/>\n<?php }?><\/dirtycode><\/p>\n<p>Because WordPress content is generated dynamically, this tells the search engines which pages should be in, and which should be out, based on the type of page. So, basically, if this is an post page, another single page, or the home page then go for it; otherwise follow the appropriate rule for Google, MSN or other spiders (Yahoo! and Ask will follow the standard robots directive) telling them not to index or archive the page but to follow any links and additionally, <a href=\"http:\/\/www.mattcutts.com\/blog\/google-supports-meta-noodp-tag\/\">for Google not to include any open directory information<\/a>. This was based on <a href=\"http:\/\/www.askapache.com\/seo\/wordpress-robotstxt-optimized-for-seo.html\">advice from askapache.com<\/a> but amended because <a href=\"http:\/\/www.seoconsultants.com\/meta-tags\/robots\/\">the default indexing behaviour for spiders is to <code>index<\/code>, <code>follow<\/code> or <code>all<\/code><\/a> so I didn&#8217;t need to specify specific rules for Google and MSN as in the original example (but did need something there otherwise the logic reads &#8220;if condition is met <em>donothing<\/em> else <em>dosomething<\/em>&#8221; and the <em>donothing<\/em> could be problematic) .<\/p>\n<p>Next, following <a href=\"http:\/\/www.filination.com\/tech\/2007\/03\/10\/wordpress-seo-using-robotstxt-to-avoid-content-duplication\/\">fiLi&#8217;s advice for using robots.txt to avoid content duplication<\/a>, I started to edit <a href=\"http:\/\/www.markwilson.co.uk\/robots.txt\" rel=\"nofollow\">my robots.txt file<\/a>. I won&#8217;t list the file contents here &#8211; suffice to say that the final result is visible on my web server and for those who think that publishing the location of robots.txt is a bad idea (because the contents are effectively a list of places that I don&#8217;t want people to go to), then think of it this way: robots.txt is a standard file on many web servers, which by necessity needs to be readable and therefore should not be used for security purposes &#8211; that&#8217;s what file permissions are for (<a href=\"http:\/\/www.robotstxt.org\/wc\/faq.html#nosecurity\">one useful analogy refers to robots.txt as a &#8220;no entry&#8221; sign &#8211; not a locked door<\/a>)!<\/p>\n<p>The main changes that I made were to block certain folders:<\/p>\n<p><dirtycode:noclick>Disallow: \/blog\/page<br \/>\nDisallow: \/blog\/tags<br \/>\nDisallow: \/blog\/wp-admin<br \/>\nDisallow: \/blog\/wp-content<br \/>\nDisallow: \/blog\/wp-includes<br \/>\nDisallow: \/*\/feed<br \/>\nDisallow: \/*\/trackback<\/dirtycode><\/p>\n<p>(the trailing slash is significant &#8211; if it is missing then the directory itself is blocked, but if it is present then only the files within the directory are affected, including subdirectories).<\/p>\n<p>I also blocked certain file extensions:<\/p>\n<p><dirtycode:noclick>Disallow: \/*.css$<br \/>\nDisallow: \/*.html$<br \/>\nDisallow: \/*.js$<br \/>\nDisallow: \/*.ico$<br \/>\nDisallow: \/*.opml$<br \/>\nDisallow: \/*.php$<br \/>\nDisallow: \/*.shtml$<br \/>\nDisallow: \/*.xml$<\/dirtycode><\/p>\n<p>Then, I blocked <a href=\"http:\/\/www.google.com\/support\/webmasters\/bin\/answer.py?answer=40367&amp;topic=8846\">URLs that include ? except those that end with ?<\/a>:<\/p>\n<p><dirtycode:noclick>Allow: \/*?$<br \/>\nDisallow: \/*?<\/dirtycode><\/p>\n<p>The problem at the head of this post came about because I blocked all .php files using<\/p>\n<p><dirtycode:noclick>Disallow: \/*.php$<\/dirtycode><\/p>\n<p>As https:\/\/www.markwilson.co.uk\/blog\/ is equivalent to https:\/\/www.markwilson.co.uk\/blog\/index.php then I was effectively stopping spiders from accessing the home page. I&#8217;m not sure how to get around that as both URLs are serving the same content, but in a site of about 1500 URLs at the time of writing, I&#8217;m not particularly worried about a single duplicate instance (although I would like to know how to work around the issue). I resolved this by explicitly allowing access to index.php (and another important file &#8211; sitemaps.xml) using:<\/p>\n<p><dirtycode:noclick>Allow: \/blog\/index.php<br \/>\nAllow: \/sitemap.xml<\/dirtycode><\/p>\n<p>It&#8217;s also worth noting that neither wildcards (<code>*<\/code>, <code>?<\/code>) nor <code>allow<\/code> are valid robots.txt directives and so the file will fail <a href=\"http:\/\/tool.motoricerca.info\/robots-checker.phtml\">validation<\/a>. After a bit of research I found that the major search engines have each added support for their own enhancements to the robots.txt specification:<\/p>\n<ul>\n<li>Google (Googlebot), Yahoo! (Slurp) and Ask (Teoma) support <code>allow<\/code> directives.<\/li>\n<li>Googlebot, MSNbot and Slurp support wildcards.<\/li>\n<li>Teoma, MSNbot and Slurp support crawl delays.<\/li>\n<\/ul>\n<p>For that reason, I created multiple code blocks &#8211; one for each of the major search engines and a catch-all for other spiders, so the basic structure is:<\/p>\n<p><dirtycode:noclick># Google<br \/>\nUser-agent: Googlebot<br \/>\n# Add directives below here<\/p>\n<p># MSN<br \/>\nUser-agent: msnbot<br \/>\n# Add directives below here<\/p>\n<p># Yahoo!<br \/>\nUser-agent: Slurp<br \/>\n# Add directives below here<\/p>\n<p># Ask<br \/>\nUser-agent: Teoma<br \/>\n# Add directives below here<\/p>\n<p># Catch-all for other agents<br \/>\nUser-agent: *<br \/>\n# Add directives below here<\/dirtycode><\/p>\n<p>Just for good measure, I added a couple more directives for the Alexa archiver (do not archive the site) and Google AdSense (read everything to determine what my site is about and work out which ads to serve).<\/p>\n<p><dirtycode:noclick># Alexa archiver<br \/>\nUser-agent: ia_archiver<br \/>\nDisallow: \/<\/p>\n<p># Google AdSense<br \/>\nUser-agent: Mediapartners-Google*<br \/>\nDisallow:<br \/>\nAllow: \/*<\/dirtycode><\/p>\n<p>Finally, I discovered that <a href=\"http:\/\/blog.pushon.co.uk\/2007\/04\/13\/sitemap-auto-discovery\/\">Google, Yahoo!, Ask and Microsoft now all support sitemap autodiscovery via robots.txt<\/a>:<\/p>\n<p><dirtycode:noclick>Sitemap: http:\/\/www.markwilson.co.uk\/sitemap.xml<\/dirtycode><\/p>\n<p><a href=\"http:\/\/www.ysearchblog.com\/archives\/000437.html\">This can be placed anywhere in the file<\/a>, although <a href=\"http:\/\/blogs.msdn.com\/livesearch\/archive\/2007\/04\/11\/discovering-sitemaps.aspx\">Microsoft don&#8217;t actually do anything with it yet!<\/a><\/p>\n<p>Having learned from my initial experiences of locking Googlebot out of the site, I checked the file using the Google robots.txt analysis tool and found that Googlebot was ignoring the directives under <code>User-agent: *<\/code> (no matter whether that section was first or last in the file). Thankfully, posts to the help groups for <a href=\"http:\/\/groups.google.co.uk\/group\/Google_Webmaster_Help-Indexing\/browse_frm\/thread\/b4ec928eaf1dba1a\">crawling, indexing and ranking<\/a> and <a href=\"http:\/\/groups.google.co.uk\/group\/Google_Webmaster_Help-Tools\/browse_frm\/thread\/5b273dc7f596580a\">Google webmaster tools<\/a> indicated that Googlebot will ignore generic settings if there is a specific section for <code>User-agent: Googlebot<\/code>. The workaround is to include all of the generic exclusions in each of the agent-specific sections &#8211; not exactly elegant but workable.<\/p>\n<p>I have to wait now for Google to re-read my robots.txt file, after which it will be able to access the updated sitemap.xml file which reflects the exclusions. Shortly afterwards, I should start to see the relevance of the <a href=\"http:\/\/www.google.co.uk\/search?q=site:www.markwilson.co.uk&amp;filter=0\"><code>site:www.markwilson.co.uk<\/code><\/a> results improve and hopefully soon after that my PageRank will reach the elusive 6.<\/p>\n<h3>Links<\/h3>\n<p><a href=\"http:\/\/www.google.com\/support\/webmasters\/\">Google webmaster help center<\/a>.<br \/>\n<a href=\"http:\/\/help.yahoo.com\/help\/us\/ysearch\/webmaster\/webmaster-01.html\">Yahoo! search resources for webmasters<\/a> (<a href=\"http:\/\/help.yahoo.com\/help\/us\/ysearch\/slurp\/\">Yahoo! Slurp<\/a>).<br \/>\n<a href=\"http:\/\/about.ask.com\/en\/docs\/about\/webmasters.shtml\">About Ask.com: Webmasters<\/a>.<br \/>\nWindows Live Search site owner help: <a href=\"http:\/\/search.msn.com.sg\/docs\/siteowner.aspx?t=SEARCH_WEBMASTER_REF_GuidelinesforOptimizingSite.htm\">guidelines for succesful indexing<\/a> and <a href=\"http:\/\/search.msn.com.sg\/docs\/siteowner.aspx?t=SEARCH_WEBMASTER_REF_RestrictAccessToSite.htm\">controlling which pages are indexed<\/a>.<br \/>\n<a href=\"http:\/\/www.robotstxt.org\/wc\/robots.html\">The web robots pages<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Here&#8217;s something that no webmaster wants to see: It&#8217;s part of a screenshot from the Google Webmaster Tools that says &#8220;[Google] can&#8217;t current access your home page because of a robots.txt restriction&#8221;. Arghh! This came about because, a couple of nights back, I made some changes to the website in order to remove the duplicate &hellip; <a href=\"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Removing duplicate search engine content using robots.txt<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[218],"tags":[25,29],"class_list":["post-772","post","type-post","status-publish","format-standard","hentry","category-technology","tag-search","tag-website-development"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Removing duplicate search engine content using robots.txt - markwilson.it<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm\" \/>\n<meta property=\"og:locale\" content=\"en_GB\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Removing duplicate search engine content using robots.txt - markwilson.it\" \/>\n<meta property=\"og:description\" content=\"Here&#8217;s something that no webmaster wants to see: It&#8217;s part of a screenshot from the Google Webmaster Tools that says &#8220;[Google] can&#8217;t current access your home page because of a robots.txt restriction&#8221;. Arghh! This came about because, a couple of nights back, I made some changes to the website in order to remove the duplicate &hellip; Continue reading Removing duplicate search engine content using robots.txt\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm\" \/>\n<meta property=\"og:site_name\" content=\"markwilson.it\" \/>\n<meta property=\"article:published_time\" content=\"2007-04-17T21:42:45+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2012-04-02T22:48:24+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.markwilson.co.uk\/blog\/images\/google-robots-restrictions.png\" \/>\n<meta name=\"author\" content=\"Mark Wilson\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@markwilsonit\" \/>\n<meta name=\"twitter:site\" content=\"@markwilsonit\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Mark Wilson\" \/>\n\t<meta name=\"twitter:label2\" content=\"Estimated reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/2007\\\/04\\\/removing-duplicate-search-engine-content-using-robotstxt.htm#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/2007\\\/04\\\/removing-duplicate-search-engine-content-using-robotstxt.htm\"},\"author\":{\"name\":\"Mark Wilson\",\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/#\\\/schema\\\/person\\\/98f61365e7c39d6be942174b8c4de468\"},\"headline\":\"Removing duplicate search engine content using robots.txt\",\"datePublished\":\"2007-04-17T21:42:45+00:00\",\"dateModified\":\"2012-04-02T22:48:24+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/2007\\\/04\\\/removing-duplicate-search-engine-content-using-robotstxt.htm\"},\"wordCount\":1142,\"commentCount\":28,\"publisher\":{\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/#\\\/schema\\\/person\\\/98f61365e7c39d6be942174b8c4de468\"},\"image\":{\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/2007\\\/04\\\/removing-duplicate-search-engine-content-using-robotstxt.htm#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/images\\\/google-robots-restrictions.png\",\"keywords\":[\"Search\",\"Website Development\"],\"articleSection\":[\"Technology\"],\"inLanguage\":\"en-GB\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/2007\\\/04\\\/removing-duplicate-search-engine-content-using-robotstxt.htm#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/2007\\\/04\\\/removing-duplicate-search-engine-content-using-robotstxt.htm\",\"url\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/2007\\\/04\\\/removing-duplicate-search-engine-content-using-robotstxt.htm\",\"name\":\"Removing duplicate search engine content using robots.txt - markwilson.it\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/2007\\\/04\\\/removing-duplicate-search-engine-content-using-robotstxt.htm#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/2007\\\/04\\\/removing-duplicate-search-engine-content-using-robotstxt.htm#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/images\\\/google-robots-restrictions.png\",\"datePublished\":\"2007-04-17T21:42:45+00:00\",\"dateModified\":\"2012-04-02T22:48:24+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/2007\\\/04\\\/removing-duplicate-search-engine-content-using-robotstxt.htm#breadcrumb\"},\"inLanguage\":\"en-GB\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/2007\\\/04\\\/removing-duplicate-search-engine-content-using-robotstxt.htm\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/2007\\\/04\\\/removing-duplicate-search-engine-content-using-robotstxt.htm#primaryimage\",\"url\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/images\\\/google-robots-restrictions.png\",\"contentUrl\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/images\\\/google-robots-restrictions.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/2007\\\/04\\\/removing-duplicate-search-engine-content-using-robotstxt.htm#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Removing duplicate search engine content using robots.txt\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/\",\"name\":\"markwilson.it\",\"description\":\"get-info -class technology | write-output &gt; \\\/dev\\\/web\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/#\\\/schema\\\/person\\\/98f61365e7c39d6be942174b8c4de468\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-GB\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/#\\\/schema\\\/person\\\/98f61365e7c39d6be942174b8c4de468\",\"name\":\"Mark Wilson\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\\\/\\\/i0.wp.com\\\/www.markwilson.co.uk\\\/blog\\\/uploads\\\/image-4.png?fit=800%2C800&ssl=1\",\"url\":\"https:\\\/\\\/i0.wp.com\\\/www.markwilson.co.uk\\\/blog\\\/uploads\\\/image-4.png?fit=800%2C800&ssl=1\",\"contentUrl\":\"https:\\\/\\\/i0.wp.com\\\/www.markwilson.co.uk\\\/blog\\\/uploads\\\/image-4.png?fit=800%2C800&ssl=1\",\"width\":800,\"height\":800,\"caption\":\"Mark Wilson\"},\"logo\":{\"@id\":\"https:\\\/\\\/i0.wp.com\\\/www.markwilson.co.uk\\\/blog\\\/uploads\\\/image-4.png?fit=800%2C800&ssl=1\"},\"description\":\"A Chartered IT Professional, with recent experience in technology leadership, IT strategy and practice management roles, Mark Wilson is an Enterprise Architect in the Advisory and Management Group at risual. During a career spanning more than two decades, Mark has gained widespread recognition as an expert in his field including both industry and national press exposure. In addition to certifications from Microsoft, VMware, Red Hat, The Open Group and Axelos, Mark held a Microsoft Most Valuable Professional (MVP) award for three years and is now part of the MVP Reconnect programme. Mark is also well-known on social media and maintains an award-winning blog.\",\"sameAs\":[\"http:\\\/\\\/www.markwilson.co.uk\\\/\",\"https:\\\/\\\/www.instagram.com\\\/markwilsonuk\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/in\\\/markawilson\\\/\",\"https:\\\/\\\/x.com\\\/markwilsonit\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UCWHlZCoHRTocdvtrOJ2IL4A\"],\"url\":\"https:\\\/\\\/www.markwilson.co.uk\\\/blog\\\/author\\\/mark-wilson\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Removing duplicate search engine content using robots.txt - markwilson.it","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm","og_locale":"en_GB","og_type":"article","og_title":"Removing duplicate search engine content using robots.txt - markwilson.it","og_description":"Here&#8217;s something that no webmaster wants to see: It&#8217;s part of a screenshot from the Google Webmaster Tools that says &#8220;[Google] can&#8217;t current access your home page because of a robots.txt restriction&#8221;. Arghh! This came about because, a couple of nights back, I made some changes to the website in order to remove the duplicate &hellip; Continue reading Removing duplicate search engine content using robots.txt","og_url":"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm","og_site_name":"markwilson.it","article_published_time":"2007-04-17T21:42:45+00:00","article_modified_time":"2012-04-02T22:48:24+00:00","og_image":[{"url":"https:\/\/www.markwilson.co.uk\/blog\/images\/google-robots-restrictions.png","type":"","width":"","height":""}],"author":"Mark Wilson","twitter_card":"summary_large_image","twitter_creator":"@markwilsonit","twitter_site":"@markwilsonit","twitter_misc":{"Written by":"Mark Wilson","Estimated reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm#article","isPartOf":{"@id":"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm"},"author":{"name":"Mark Wilson","@id":"https:\/\/www.markwilson.co.uk\/blog\/#\/schema\/person\/98f61365e7c39d6be942174b8c4de468"},"headline":"Removing duplicate search engine content using robots.txt","datePublished":"2007-04-17T21:42:45+00:00","dateModified":"2012-04-02T22:48:24+00:00","mainEntityOfPage":{"@id":"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm"},"wordCount":1142,"commentCount":28,"publisher":{"@id":"https:\/\/www.markwilson.co.uk\/blog\/#\/schema\/person\/98f61365e7c39d6be942174b8c4de468"},"image":{"@id":"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm#primaryimage"},"thumbnailUrl":"https:\/\/www.markwilson.co.uk\/blog\/images\/google-robots-restrictions.png","keywords":["Search","Website Development"],"articleSection":["Technology"],"inLanguage":"en-GB","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm","url":"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm","name":"Removing duplicate search engine content using robots.txt - markwilson.it","isPartOf":{"@id":"https:\/\/www.markwilson.co.uk\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm#primaryimage"},"image":{"@id":"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm#primaryimage"},"thumbnailUrl":"https:\/\/www.markwilson.co.uk\/blog\/images\/google-robots-restrictions.png","datePublished":"2007-04-17T21:42:45+00:00","dateModified":"2012-04-02T22:48:24+00:00","breadcrumb":{"@id":"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm#breadcrumb"},"inLanguage":"en-GB","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm"]}]},{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm#primaryimage","url":"https:\/\/www.markwilson.co.uk\/blog\/images\/google-robots-restrictions.png","contentUrl":"https:\/\/www.markwilson.co.uk\/blog\/images\/google-robots-restrictions.png"},{"@type":"BreadcrumbList","@id":"https:\/\/www.markwilson.co.uk\/blog\/2007\/04\/removing-duplicate-search-engine-content-using-robotstxt.htm#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.markwilson.co.uk\/blog"},{"@type":"ListItem","position":2,"name":"Removing duplicate search engine content using robots.txt"}]},{"@type":"WebSite","@id":"https:\/\/www.markwilson.co.uk\/blog\/#website","url":"https:\/\/www.markwilson.co.uk\/blog\/","name":"markwilson.it","description":"get-info -class technology | write-output &gt; \/dev\/web","publisher":{"@id":"https:\/\/www.markwilson.co.uk\/blog\/#\/schema\/person\/98f61365e7c39d6be942174b8c4de468"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.markwilson.co.uk\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-GB"},{"@type":["Person","Organization"],"@id":"https:\/\/www.markwilson.co.uk\/blog\/#\/schema\/person\/98f61365e7c39d6be942174b8c4de468","name":"Mark Wilson","image":{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/i0.wp.com\/www.markwilson.co.uk\/blog\/uploads\/image-4.png?fit=800%2C800&ssl=1","url":"https:\/\/i0.wp.com\/www.markwilson.co.uk\/blog\/uploads\/image-4.png?fit=800%2C800&ssl=1","contentUrl":"https:\/\/i0.wp.com\/www.markwilson.co.uk\/blog\/uploads\/image-4.png?fit=800%2C800&ssl=1","width":800,"height":800,"caption":"Mark Wilson"},"logo":{"@id":"https:\/\/i0.wp.com\/www.markwilson.co.uk\/blog\/uploads\/image-4.png?fit=800%2C800&ssl=1"},"description":"A Chartered IT Professional, with recent experience in technology leadership, IT strategy and practice management roles, Mark Wilson is an Enterprise Architect in the Advisory and Management Group at risual. During a career spanning more than two decades, Mark has gained widespread recognition as an expert in his field including both industry and national press exposure. In addition to certifications from Microsoft, VMware, Red Hat, The Open Group and Axelos, Mark held a Microsoft Most Valuable Professional (MVP) award for three years and is now part of the MVP Reconnect programme. Mark is also well-known on social media and maintains an award-winning blog.","sameAs":["http:\/\/www.markwilson.co.uk\/","https:\/\/www.instagram.com\/markwilsonuk\/","https:\/\/www.linkedin.com\/in\/markawilson\/","https:\/\/x.com\/markwilsonit","https:\/\/www.youtube.com\/channel\/UCWHlZCoHRTocdvtrOJ2IL4A"],"url":"https:\/\/www.markwilson.co.uk\/blog\/author\/mark-wilson"}]}},"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":773,"url":"https:\/\/www.markwilson.co.uk\/blog\/2007\/05\/adding-a-meaningful-description-to-web-pages.htm","url_meta":{"origin":772,"position":0},"title":"Adding a meaningful description to web pages","author":"Mark Wilson","date":"Tuesday 8 May 2007","format":false,"excerpt":"One of the things that I noticed whilst reviewing the Google results for this site, was how the description for every page was shown using the first text available on the page - mostly the alternative text for the masthead photo (\"Winter market scene from the small town of Porjus\u2026","rel":"","context":"In \"Search\"","block_context":{"text":"Search","link":"https:\/\/www.markwilson.co.uk\/blog\/tag\/search"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":687,"url":"https:\/\/www.markwilson.co.uk\/blog\/2007\/02\/improving-search-engine-placement.htm","url_meta":{"origin":772,"position":1},"title":"Improving search engine placement (without breaking the rules)","author":"Mark Wilson","date":"Tuesday 6 February 2007","format":false,"excerpt":"Search engine optimisation (SEO) has a bad reputation. That's tough for SEOs but unfortunately it's a side-effect of black hat SEO techniques. I haven't knowingly used any SEO techniques as this blog is really just a hobby of mine. I enjoy writing for it, find it a good place to\u2026","rel":"","context":"In \"Search\"","block_context":{"text":"Search","link":"https:\/\/www.markwilson.co.uk\/blog\/tag\/search"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":328,"url":"https:\/\/www.markwilson.co.uk\/blog\/2006\/07\/helping-spiders-to-crawl-around-my-bit.htm","url_meta":{"origin":772,"position":2},"title":"Helping spiders to crawl around my bit of the web","author":"Mark Wilson","date":"Tuesday 25 July 2006","format":false,"excerpt":"A few months back, I blogged that my Google PageRank had fallen through the floor on certain pages. I was also concerned that the Google index only contained about half the content on my website. I don't engage in search engine optimisation, but I have found out a few things\u2026","rel":"","context":"In \"Search\"","block_context":{"text":"Search","link":"https:\/\/www.markwilson.co.uk\/blog\/tag\/search"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":264,"url":"https:\/\/www.markwilson.co.uk\/blog\/2006\/03\/why-have-some-of-my-pageranks-dropped.htm","url_meta":{"origin":772,"position":3},"title":"Why have some of my PageRanks dropped?","author":"Mark Wilson","date":"Wednesday 22 March 2006","format":false,"excerpt":"It's well known that the Google index is based on the PageRank system, which can be viewed using the Google Toolbar. But something strange has happened on this blog - the main blog entry page has a PageRank of 5, the parent website has a PageRank of 4, but the\u2026","rel":"","context":"In \"Search\"","block_context":{"text":"Search","link":"https:\/\/www.markwilson.co.uk\/blog\/tag\/search"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":491,"url":"https:\/\/www.markwilson.co.uk\/blog\/2005\/05\/how-to-take-part-in-some-time-travel.htm","url_meta":{"origin":772,"position":4},"title":"How to take part in some time travel","author":"Mark Wilson","date":"Thursday 26 May 2005","format":false,"excerpt":"So you thought that old version of your website was gone forever? It may have been a little naive of me, but I figured that once I put up a new version of my website, then that was it, the old one was overwritten. Not so, it seems - today\u2026","rel":"","context":"In \"Useful Websites\"","block_context":{"text":"Useful Websites","link":"https:\/\/www.markwilson.co.uk\/blog\/tag\/useful-websites"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":716,"url":"https:\/\/www.markwilson.co.uk\/blog\/2007\/02\/the-search-engine-friendly-way-to-merge-domains.htm","url_meta":{"origin":772,"position":5},"title":"The search engine friendly way to merge domains","author":"Mark Wilson","date":"Tuesday 27 February 2007","format":false,"excerpt":"In common with many website owners, I have multiple domain names pointing at a single website (markwilson.co.uk, markwilson.me.uk and markwilson.it). There's nothing wrong with that (it's often used to present localised content or to protect a trademark) but certain search engines will penalise sites where it appears that multiple URLs\u2026","rel":"","context":"In \"Domain Names\"","block_context":{"text":"Domain Names","link":"https:\/\/www.markwilson.co.uk\/blog\/tag\/domain-names"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.markwilson.co.uk\/blog\/wp-json\/wp\/v2\/posts\/772","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.markwilson.co.uk\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.markwilson.co.uk\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.markwilson.co.uk\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.markwilson.co.uk\/blog\/wp-json\/wp\/v2\/comments?post=772"}],"version-history":[{"count":9,"href":"https:\/\/www.markwilson.co.uk\/blog\/wp-json\/wp\/v2\/posts\/772\/revisions"}],"predecessor-version":[{"id":3821,"href":"https:\/\/www.markwilson.co.uk\/blog\/wp-json\/wp\/v2\/posts\/772\/revisions\/3821"}],"wp:attachment":[{"href":"https:\/\/www.markwilson.co.uk\/blog\/wp-json\/wp\/v2\/media?parent=772"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.markwilson.co.uk\/blog\/wp-json\/wp\/v2\/categories?post=772"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.markwilson.co.uk\/blog\/wp-json\/wp\/v2\/tags?post=772"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}