Monday, January 31, 2011

Duplicate Content Myths & Facts


I see a lot of people misunderstanding the whole duplicate content thing. Some say such thing does not exists and if you start an autoblog you will get rich, others swear that Google loves unique content and you have to write everything by hand. As usual, the truth is somewhere in between.
What exactly is duplicate content
This is where most people misunderstand the whole situation. Search engines define it as two or more pages on the same site, that have identical or very similar content. Some people put identical or very similar content that exists on different sites in the same category, though. For clarity, from here on we will call duplicate content on pages of the same site “duplicate” and the duplicate content on different sites “non-original”. By “non-unique” or “unique” I will refer to any type of content that is not unique (or unique), regardless if it is on same site or on different sites.
Quality is the reason
There is no penalty with any of the two cases. Penalty means punishment. Punishment would mean search engines see you have non-unique content so they decide you’ve been a bad boy and you have to be spanked. That is not the case. The real reason why search engines have an issue with non-unique content is that they want to offer quality in they SERPs. The purpose of the search engine is to give you answers to your search.
Let’s take an example situation where you search for “digital camera”. What are you really searching for. If you would refine the search, you could make it any of the following:
  • buy digital camera
  • digital camera shopping
  • digital camera store
  • digital camera reviews
  • digital camera specs
  • best digital camera
  • compact digital camera
  • dslr digital camera
Now, the key is to determine user intent. You may be interested in information (specs, reviews, which is the best model), or perhaps to buy one (buy, shopping), or maybe you have a specific type of camera in mind (compact, dslr). If you search for one of the variations in that list, the SE knows better what you’re looking for. If you search for “digital camera” however, SE don’t know what kind of information you want. They don’t know much about your intent, besides the fact it is related to digital cameras.
In order to make sure you get the reason you were looking for, a SE will try to include all types of results in its SERPs. You will get a few online retailers likeAmazon.com, a few review sites like reviews.cNet.com, generic sites like Wikipedia, some YouTube videos in case you prefer a video over reading, etc. Basically, the SE gives the user a comprehensive result.
Now, imagine you just searched for “digital camera” and the SE returned only stores. If you want to buy that’s perfect, but if you want to read reviews you think “WTF, I don’t want this stuff, I am not ready to buy”. That’s the mildest form of getting results that are not useful. In a way is like being served salad, then salad, then salad again instead of a stake with salad and then desert. Imagine if we take this a step further and instead of getting 10 shopping results with different information, price, etc. you get 10 sites but each of them is entirely identical to the other one. What’s the point to get 10 different results if all sites have the same price and give you the same information?!
This is exactly why a SE will do its best to give you results that are different from each other. Even if you search for a very specific long tail keyword like “Canon T2i Digital Camera Review”, you don’t want 10 results that point to different sites having the same information. You want to read more than one opinion about that particular digital camera.
Duplicate content (on the same site)
Since a SE will only include one result from each site in it’s SERPs, if you have two or more pages with the same content on your site, it has to pick one and discard the others. Of course you can get a double listing, where the SE will show a second indented URL from your site. However, the important thing to note in this case is that the SE will show a second page only if it thinks it is a good complementary result to the first one. It will never show as a secondary indented result a URL of a duplicate page because it make no sense.
What you have to do in this case is make sure that the SE will show the URL you want. You may get duplicate pages on your site because of the tag listings in the case of blogs, or because of dynamic GET variables (e.g.product_listing.php?min_price=100&max_price=200) which actually return the same items as a different set of variables (e.g. product_listing.php?min_price=0&max_price=300). Those are not the only ways to end up with duplicate content on your site. There are countless ways. I won’t go into details about what you should do to avoid having duplicate content on your site.
There are many good guides about this topic, information architecture and usability. One thing I will mention because many people disregard it, is that you are very likely to be better off with the tag listings marked with “d0f0llow,n0index”. Obviously this is valid generally for blogs. Also, it helps to add a description of about 100-150 words on your category pages or any listing pages so even if you do end up with most of the posts in the listing being the same, that description will make the page be a bit different. Most importantly, have different page titles (for pagination you can add a “page 2″ after the title). As I said, I won’t go more into it here. Information architecture is however a very important aspect when it comes to both SEO and usability.
Non-original content (on different sites)
If you use non-original content on your site (e.g. you “stole” it from some other site) and you don’t rank, it means you’re penalized, right? NO! If you’ve been following through, you should already know why you don’t rank. If the source site (from where you stole the content) already ranks with that content, there is no way that you will rank for the same keywords. Since the keywords in this piece of text are one of the ranking factors, the other major factor being backlinks, if you don’t have more and/or better targeted backlinks than the source, you won’t rank for any keyword. Bottom line is that if the source ranks for a keyword, you won’t rank for it, or you will rank way below the source site (source ranks on #5, you rank on #250).
That means, in order to rank with non-original content, you have to beat all the other sites (that use the same content) with backlinks. This is a really bad idea, because unless you post this non-original content on a domain with high trust rank and authority and at the same time build strong and/or many relevant backlinks, you will never outrank the source site, or whatever site has the same content and does a better job at link building and trust rank growth.
If you have a 1 month old site with a relatively small number of backlinks and about 100 pages and Wikipedia steals all 100 pages and publishes them on wikipedia.com you can bet your right arm they will rank #1 for any keyword related to those pages and you won’t rank anywhere. They don’t even need external backlinks towards those specific pages, their trust rank and domain authority is enough.
Some content is inherently non-original
The SE engineers naturally realized that some content will get copied a lot on many sites. It is the natural life of such content. Think about product specifications, news, press releases, quotes, etc. A digital camera has certain specifications released by the manufacturer. While you can change a word here and there, most of it stays the same. News gets published on a bunch of sites, especially short news. If Barack Obama said something, you have to quote him as he said it. You can’t change his words.
Mashups
An interesting breed of sites with non-original content are the mashup sites. They basically construct a page with content on a given topic, by combining small excerpts from multiple sites that cover that topic. This, while resulting in non-original content is actually very useful for a visitor because it gives him a collection of summaries about the topic of interest.
Imagine you want to buy a digital camera, you could manually search for reviews on a SE, then search for specs, then search for stores and try to find the best price. It’s time consuming. Instead you could go to AlaTest.com orTestFreaks.com which are mashups of reviews and get all the info you need in one place. It’s comprehensive, you can go to the source site to read more in-depth if you wish so and you save time.
The content of mashups is not a perfect copy of some specific page of another site, but a collection of fragments of multiple sites. It is a big difference between this and how most people implement autoblogs – by copying the excerpt of one specific post only from it’s RSS feed.
What you should remember from this is that if you want to build autoblogs, you should think mashups instead. It is significantly more difficult to implement a mashup system but if you do it right, it is worth every minute.
Spun content
Spun content, if done well, can look original enough to make it in the SERPs. Of course, the “if done well” is the key. Most people don’t invest the proper time to manually write complex, multi-level seeds. They just replace words with their synonyms which means the number of good (original enough) spins is very low. Since they are eager and adepts of “get rich quick” mentality, they will generate way more spins than optimal, reducing the uniqueness of ALL spins (results). If you write a complex seed article and spin it within the optimal limits however, this technique can give you a lot of content and for a fraction of the cost of what you’d pay somebody to write it for.
A few things to keep in mind regarding spun content:
  • It is extremely difficult to develop an algorithm and software to accurately detect non-original content across the entire Web. It would require extreme processing power that not even Google has. This is because you would literally have to compare any web page to every other page on the web. However, there are shortcuts that though are not great, can be implemented using much lower processing power. They generally rely on some sort ofstatistical natural language processing (generally using N-gram patterns). Using such an approach, it is actually extremely easy and not resource intensive to detect content resulted from too many spins.
  • It is relatively easy to detect incorrect grammar. Most content is not grammatically correct since not even native English writers are 100% grammatically correct on the web. However, when you detect a very large percentage of grammar mistakes in all the text on a site, it’s a high statistical chance there’s something fishy with it.
What you want from your content?
When trying to decide whether to use non-original content or write original one, you should think what you require from that content. If you need content for a high quality site with a lot of traffic (a.k.a. potential customers), you obviously want content that converts. If you want content for your link network/wheel/pyramid/spherical-cube-with-5-edges and you don’t need those sites to convert but just sit at the bottom of the “food chain”, feel free to use non-origina content. Just remember that mashups, mixed, spun or any type of “randomization” (for lack of a better term) is much better than just scraping the excerpt from a RSS feed or stealing the entire article as-is from an article directory.
By the way, “spherical-cube-with-5-edges” is a new, extremely powerful linking scheme that I developed. It relies on principles of quantum mechanics and the theory of relativity. Yeah, I’m just messing with you, genius… I smell some melted brain there.
Going black hat
WARNING: This is ILLEGAL. It is copyright infringement. Anyway, I bet some of you already do it, so I’ll tell you how to do it properly.
Have you ever stolen articles from article directories or even from the source sites? Many people do it. Funny thing is that besides from being illegal because you’re stealing somebody’s work, it doesn’t help them much either. That’s because their timing is wrong.
They go and search for “my super duper keyword” on EzineArticles or whatever, pick 20-30 articles and dump them onto their new blog. They remove any link from the article (resource box) and place their own links in the body of the article. Then they repeat the process. Maybe if they are more skilled when it comes to coding, they build an automated system to do it while they sleep. Regardless how they do it, it is not too efficient. If you’ve been paying attention you should know why. They basically copy the article as-is, the full article and as I explained way above, end up competing against the source site (e.g. EzineArticles) solely in the backlinks arena. Good luck with that, you will need it. If that article is old, a SE is pretty sure you’re not the source. If it got picked up by other sites and published on them you don’t compete just with the source site, but the others too. You’re screwed! The link you get from that page won’t be entirely useless, but not too helpful either. There is one thing you can do however: monitor new sites that pop-up in the SERPs for your keywords (note I said new sites not new pages of old sites) and also new articles on article directories (make a script to monitor them or their RSS feeds). Steal the content as usual, then throw some links to that page quickly. Don’t build crappy links though. No N0F0ll0w crap or profiles that take long to get indexed by the SE.
The goal is to get that new page indexed fast and make the SE also find the links towards it pretty fast. How fast? If you can get indexed in 15min you’re game. Works up to a few days too, just not as well. If you get indexed before the source site and the SE picks up the backlinks too, you have a chance of actually ranking higher. Otherwise, you won’t rank higher than the source but you will outrank the other sites that steal the article later and are too lazy to build links to it.
Ideally, you don’t want to rank higher than the source because people might complain. Heck it doesn’t even matter if you rank in the first 200 positions. But if you do it like this, the links from that stolen article on your site is more valuable.
There are some things you have to keep in mind:
  • If your entire site has content built like this, it won’t do too great. Not even when it comes to the backlink value. Google will simply see you’re “fissy” and not love you much. However, if you mix this technique with some properly spun content, some mashup-style pages, etc. Google won’t have an easy time figuring out if it’s a spam-blog or a normal site.
  • It’s not really worth doing it if you do it by hand and/or with your own domains. Unless you automate the whole thing, and use junk sites (e.g. WPMU hosts, free blog platforms) is too much effort for too little gain.
That’s all folks! I hope it makes things more clear for you and helps you build better strategies. When you make your 1st million dollars don’t forget to send me a bottle of Jack Daniels.

0 comments:

Post a Comment