Indexing timeline (old, probably outdated)

Warning: This page is pretty old and was imported from another website. Take any content here with a grain of salt.

… or examining a Google automated spam penalty

Matt Cutts, a Google Engineer, explained in his private blog how off-topic and affiliate links can change the Google crawlers “priority” for a site, even deindexing it. The examples shown were quite extreme, but what surprised me was this part:

The person said that every page has original content, but every link that I clicked was an affiliate link that went to the site that actually sold the T-shirts.

(from http://www.mattcutts.com/blog/indexing-timeline/ )

Is Google really rating content based on the links on a page? Can a site get a penalty for having all out-links be affiliate links? How does the penalty look like, how can you tell the site is penalised?

Personally, I feel there are two problems with this system:

  1. if a webmaster chooses to monazite his site with affiliate links, he should be allowed to do so.
  2. this is giving Adsense an unfair advantage - since it does not seem to be targeted.

Yes, there are sites that abuse the affiliate idea, but then again, there are sites made for Adsense as well. Either both should be allowed or both should be penalised. Make a choice.

But let’s get back to the “(non-Adsense) spam penalty”. The site I mentioned earlier (http://a380-watch.org/ (archive.org)) was receiving about 1000 uniques/day from Google. As an experiment, I changed the front page to use an Amazon “Self optimising links” block (contextual matches from Amazons product database) and removed the Adsense blocks (Adsense is not allowed together with other forms of contextual advertising). Additionally, I hand-picked one book from their database which is available on amazon.de and amazon.co.uk (the two main groups visiting the site) and set up three links each per amazon-site on a single page (2 separate links, each shown 3 times). There is only one other out-link on the whole site, and it is “protected” with a rel=nofollow tag. All 6 out of 6 out-links were now Amazon affiliate links, additionally an Amazon script block was on the front page. There were no other changes done. All pages (except the front) have 2-3 Adsense blocks on them (plus a Google “related links” box) - and have had these blocks since the start of the site.

In less than 2 days the site disappeared from Googles search results. The site seemed to be penalised for advertising.

Exactly 30 days later, the site came back, with just the front page indexed. A perfect 30 day “spam penalty”. The cache of the page is dated a day before the penalty too effect.

Perhaps a day after I noticed the removal of the site in Google I removed the Amazon script block (and restored Adsense) on the front page and reduced the Amazon affiliate links to two.

During the time of the penalty, the site was still being crawled by the Googlebot, albeit mostly the front page and not the other pages. The sitemap file (I used a simple text listing of the URLs) and the robots.txt were also crawled daily.

It is not possible to deduce a general behaviour from this one site. It would need to be tested to be certain. But it seems that:

  1. Google is really rating (and penalising) sites based on their outbound links.
  2. Affiliate links and/or affiliate script blocks can count towards a “spam penalty”.
  3. Adsense (mis-)usage is irrelevant.
  4. A small number of pages can trigger a penalty for the whole site (in this case 2 out of over 20 pages).
  5. When Google penalises a site with a “spam penalty” it will still crawl the site, but perhaps reduced in frequency.

From this test it was not possible to tell if Google only applies a timed penalty and checks the site when it is indexed again or if Google checks the site beforehand.

The indexing timeline at Google

For those interested in the numbers and times, here is a concentrated view of what happened. All times are in UTC.

18.05.2006 13:00:00 (approx.) Added Amazon links, Amazon script on front page and removed Adsense on the front page.

Crawls by Google after the change of content - only Amazon URLs (/index.htm with a script, /spot.htm with several links), robots.txt + sitemap.txt are included here, the others are not shown here. Though the Mediabot now uses the same shared cache, the Googlebot covered enough to merit concentrating our efforts on it.

18.05.2006 14:09:25 66.249.66.168 /index.htm Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)

18.05.2006 15:26:21 66.249.66.168 /sitemap.txt Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)

18.05.2006 22:07:21 66.249.66.168 /spot.htm Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)

19.05.2006 02:15:02 66.249.66.168 /index.htm Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)

19.05.2006 02:23:13 66.249.66.168 /index.htm Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)

19.05.2006 14:00:00 (approx.) The site is not performing as it should: it is not found across most datacenters with a site:-query and not listed in the SERPS

19.05.2006 21:50:00 Last valid search user via Google - site completely deindexed on Google

19.06.2006 02:32:00 (ca) Site back in Google.com (only main URL though) - penalty duration: 30 days from find

The page is cached with a date of “18 May 2006 07:25:49 GMT”. (Psssst Google: “GMT” was replaced by “UTC” in 1972 (archive.org) )

Crawler activity during the 30-day penalty

Googlebot + Mediabot, all pages (incl. robots.txt, sitemap): 267 accesses (=8.9/day)

Googlebot (alone), all pages (incl. robots.txt, sitemap): 68 accesses (=2.2/day)

(all pages once, except /index.htm: 17x, /robots.txt: 12x, /sitemap.txt: 24x)

To compare, crawler activity in the 30 days before the penalty:

Googlebot + Mediabot, all pages (incl. robots.txt, sitemap): 272 accesses (=9.0/day)

Googlebot (alone), all pages (incl. robots.txt, sitemap): 136 accesses (=4.5/day)

(average 5x/page, /index.htm: 19x, /robots.txt: 12x, /sitemap.txt: 18x)

The average crawler activity on the sitemap.txt, robots.txt and front page remained almost stable; the activity on the other pages was reduced.

Warning: This page is pretty old and was imported from another website. Take any content here with a grain of salt.

Comments / questions

There's currently no commenting functionality here. If you'd like to comment, please use Twitter and @me there. Thanks!

Tweet about this - and/or - search for latest comments / top comments

Related pages