Results from our Sitemaps study: Site B (old, probably outdated)

Warning: This page is pretty old and was imported from another website. Take any content here with a grain of salt.

Site B is a mixture of Site A (Google Sitemaps, no links) and Site C (Adsense, GoogleBar, no Sitemaps, no links). Site B uses Google Sitemaps along with Adsense blocks - and is visited regularly by a virtual visitor using the Microsoft Internet Explorer with the GoogleBar plug-in.

Seeing that neither Site A nor Site C were indexed properly with Google, we can only assume that Site B will also not be indexed. Unless, ….. Adsense and GoogleBar help Google Sitemaps overcome the activation threshold?

Nope. Google Sitemaps, even combined with Adsense and GoogleBar visitors is not enough to get a site indexed (other than the main URL “/"). But the main URL did get indexed pretty fast; so it would be safe to assume that Google did not put the site into the famed “sandbox” first. Are we doing something right (on accident) or does it check for special keywords (which we missed, on accident)?

The Google Sitemap file was read more or less in the same way as for Site A, the patterns we saw there were repeated, making it look like a general update in the system (when the changes in crawling the sitemap file took place) instead of a simple fluke on one site.

What happened? Here’s a short time line (only for Google):

  • T+0 days: Site went live
  • T+2.8d: Main URL “/” is listed in Google, without description, not crawled.
  • T+7.3d: GoogleBot crawls “/”
  • T+11.2d: Main URL is fully listed in Google, with description, ranking nr. 11 for our keyword
  • T+22.1d: Main URL drops out of index completely
  • T+25.1d: Main URL is back in the index, without description
  • T+29.4d: GoogleBot crawls “/”
  • T+35.8d: Main URL drops out of index, never to be seen again
  • T+43.3d: GoogleBot does something mystical: it crawls a 3rd level URL, one it could only have seen from the sitemap file or from the Adsense-Bot (which spotted the 2nd level URL above it at T+18.9d) - more on that below
  • T+46.0d: GoogleBot crawls “/”
  • T+46.0d: GoogleBot does Mystery-Crawl nr. 2
  • T+48.4d: GoogleBot does Mystery-Crawl nr. 3 + every couple of days more
  • T+54.1d: GoogleBot crawls “/”
  • T+55.3d: GoogleBot crawls “/”
  • T+68.1d: GoogleBot crawls “/”
  • T+78.6d: GoogleBot crawls “/”

So what about our mystical crawling experience with the GoogleBot? It crawled a total of 50 URLs (the first time) from all levels and areas - they must have come from the Google Sitemap file we submitted. It seems like there is only very little missing for Google to go from “no interest” (as with Site A) to “little interest - I’ll crawl a bit”. I wonder what would be the minimum requirement to go from crawling to listing in the index?

At any rate, “Mystery-Crawl-nr. 2” crawled 88 URLs, nr. 3 did 82 URLs, after that it came every couple of days and got 602 URLs more. Altogether it got 820 unique URLs from our site (out of 911 URLs possible). It crawled in a more or less random order, ie. not in sitemap file order nor in linked page order. I wonder if all of that is sitting at Google awaiting “activation” or if it is already dumped? The user agent was the normal “Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)” and the IP matched the other GoogleBot queries I have seen.

What about the Yahoo! search engine? We’ve noticed that it uses the domain registration data to find new domains and likes to crawl them a bit. This site was no exception:

  • T+14.9d: Main URL is crawled
  • T+15.4d: 2 first level URLs are crawled, then 1 second level and 1 third level URL (perhaps some kind of website status check?)
  • T+15.6d: starts crawling from the level 2/3 which it found above
  • T+18.3d: Main URL is indexed, ranking nr. 4 in the results
  • T+19.9d: starts normal the crawl on level 1 … crawls slowly until the end of our tests
  • T+20.1d: 2 URLs are indexed
  • T+22.1d: 3 URLs are indexed
  • T+23.1d: 6 URLs are indexed
  • T+24.9d: 7 URLs are indexed
  • T+25.8d: 8 URLs are indexed
  • T+26.0d: 9 URLs are indexed
  • T+32.3d: 24 URLs are indexed, then 34, then 21, then back to 9 URLs (huh? similar to Site A?)
  • T+35.4d: 16 URLs are indexed
  • T+39.1d: 34 URLs are indexed
  • T+61.3d: 41 URLs are indexed
  • T+73.3d: alternating 41 URLs and 1 URL indexed (every couple of hours) (huh?)
  • T+78.5d: back to 41 URLs, more or less stable

Altogether, Yahoo crawled 65 unique URLs, but some up to 5 times during our test period. It checked “/” 328 times, the robots.txt (non-existent) 416 times.

MSN? Did anyone tell MSN they were missing the party? :-) Oh well, maybe next year…

Our automatic visitor came by 3069 times (in our slightly over 80 days test period), visited 781 unique URLs. All accesses were ok from the server side and match the URLs that should have been accessed, i.e. the server was available all the time.

Other interesting visitors:

  • T+10.9d: Visit by “SurveyBot/2.3+(Whois+Source)” checking robots.txt and “/” - from Compass Communications Inc. / whois.sc
  • T+17.6d: another visit by SurveyBot
  • T+24.7d: another visit by SurveyBot
  • T+29.3d: How could we live without Cyveillance? :-) (51 URLs)
  • T+31.8d: another visit by SurveyBot
  • T+35.2d: No site could be complete without totaldomaindata.com visiting with “Mozilla/5.0+(X11;+U;+Linux+i686;+en-US;+rv:1.7.5)+Gecko/20041107+Firefox/1.0” (checked “/")
  • T+38.2d: another visit by SurveyBot
  • T+38.8d: 2 hits by Compass Communications Inc. (IP 64.246.165.236 / whois.sc) using “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.2;+SV1;+.NET+CLR+1.1.4322)” to access the “/” and the stylesheet. A real visitor or another marketing bot (http://forum.statcounter.com/vb/showthread.php?t=18495 (archive.org))?
  • T+45.0d: another visit by SurveyBot
  • T+52.1d: another visit by SurveyBot
  • T+59.1d: another visit by SurveyBot
  • T+61.2d: Another Cyveillance visit (51 URLs) - one day I’ll link a few MP3s just to see what happens…. this time it checked the stylesheet file as well (?)
  • T+67.0d: another visit by SurveyBot
  • T+73.3d: another visit by SurveyBot
  • T+77.3d: “Mozilla/4.0+(compatible;+Netcraft+Web+Server+Survey)” visits “/”

Special result codes:

  • The SurveyBot visits all returned 206 (partial content) - meaning the other side requested only partial content. Why? Why not use a HEAD request? Hmm..
  • Most of the Yahoo/Slurp-Visits for “/” returned 304 (not modified), meaning the other side did a conditional GET - meaning that Yahoo only wants to check to see if we have something new. Hmm - changes on the page = interesting for Yahoo :-)
  • The only 404s we had to send (other than for the robots.txt) were for 2 Google “file-not-found” tests (at T+18.6d and T+49.4d).

Stylesheet file accesses:

  • T+29.3d: by Cyveillance
  • T+38.8d: by Compass Communications Inc / whois.sc
  • T+61.2d: by Cyveillance

So Google and Yahoo (those who indexed) didn’t access the external stylesheet file. Does that mean all those rumours about Google checking the stylesheet file for cloaking / hidden text are not true? We’ll have to check with a site that indexes better!

Warning: This page is pretty old and was imported from another website. Take any content here with a grain of salt.

Comments / questions

There's currently no commenting functionality here. If you'd like to comment, please use Twitter and @me there. Thanks!

Tweet about this - and/or - search for latest comments / top comments

Related pages