I’m still trying to figure out what is going on here and why the Guardian is publishing all this duplicate content. As a major UK publisher, it appears to be on its own - nobody else seems to be doing this.
OK, here are some facts about it:
The peak for all this appears to be early in the morning and mainly involves content that has appeared in the newspaper overnight. Most of this morning’s stories appearing on guardian.co.uk portal seem to have been published an average of seven times. From mid morning onwards, the average falls back to two or three.
The most heavily duplicated stories tend to be listed as main articles on Google News in the morning.
Duplicated articles carry the same Hitbox code - Hitbox is the statistics provider which tracks the usage of guardian.co.uk’s content and is used to determine the number of uniques, page views etc. Here’s the code on seven versions of the same story about Israel planning to free Palestinian prisoners:
What is going on here? Is this just a story taken from an RSS feed and duplicated repeatedly because of some quirk in the Guardian’s CMS? Or is it something more calculated, a tactic that has something to do with how a story is displayed in Google News? I wonder about the latter because Google News is not nearly so fussy about duplicate content as the main Google search and the higher frequency of guardian.co.uk’s republishing early in the morning seems to affect Google News rankings.
Perhaps somebody from guardian.co.uk would care to clear all this up?






Looks like the RSS feeds are being crawled and each one is passing through a URL parameter (feed=society for example) in order to identify which feed the traffic to the article came from. I’d imagine this is purely about internal stats for feed traffic and nothing more. It’s not uncommon for pages to get indexed multiple times with slightly different URL parameters, usually Google is smart enough to tell what’s going on and spits out the following message though:
In order to show you the most relevant results, we have omitted some entries very similar to the 1 already displayed.
If you like, you can repeat the search with the omitted results included.