How does Google identify duped content?

Google Algorithm N-Grams

So everyone is always telling you that you’ve got to write unique content to make the most of backlinks and search rankings, but it takes a lot of time to write 500+ word articles, blog posts, and intelligent blog comments. And we are all lazy, so you look for another route to take.

Then you read online about auto blogs, scrape programs, article rewriters, and all these really cool SEO tools that dominate the world. Quickly you start spending a few $100 on these programs and blasting backlinks and flipped content everywhere on the net.  Soon you have some success and see traffic starting to grow, rankings being upgraded and you are moving up. But then, after using these tools, all of a sudden you see your beloved 1,000,000 pages scraped site de-indexed.


How?  You were spinning your content!

Google was able to match a very large % of your content to copyrighted content already published on the internet through N-grams.  No one is really sure how much of your content would need to be identified as matching, but due to the inaccuracies of N-grams, it’s suggested that it would be well over 80%.

What is a N-Gram?

any 3+ word phrases that hold the same meanings

different words but matching concept/meaning

matching word count in phrase

Examples of N-Gram phrases:

 “seo software downloads” =  “seo program downloads” = “seo tool downloads”

“backlink building strategies” = “backlink building techniques” = “backlink building tips”

Now using this technique, we could see how Google could evaluate a spun page of content and see it’s duplicate.  If you only spin adjectives and single words in your article or blog, when looked at through the N-Gram lens the content will have matching conceptual phrases/meanings throughout the paragraphs and phrases.  Different words yes, but matching meaning/definitions in such high % that it cannot be ignored.

These are not EXACT matching word phrases BUT the meanings can be identified as extremely similar or exact matches. This formula for finding duped content can be used by Google, locating high % matches in phrase meanings in content that is from the same niche/market/industry.  Then they can manually review certain sites that may be generating high ad-sense performance or generating thousands upon thousands of pages on the search engines quickly.  We all know how easy it is to read spun or duped content, you can identify this immediately.

Leave a Reply

Your email address will not be published. Required fields are marked *