Results from our Sitemaps study: Site D (old, probably outdated)

Warning: This page is pretty old and was imported from another website. Take any content here with a grain of salt.

Site D is a normal website, with a little startup-funding in form of deep links from several external sites. It does not use Google Sitemaps, nor anything otherwise special.

There were 4 links, one to each of the 4 levels, in different parts of the site. The site structure is strictly top-down, with links from the parent to about 10 children and a link to the main URL. There are no cross-links and no links from the children to the parent (just to the main URL).

Traditional theory has it that a site like this will get indexed properly pretty quickly, depending on the value of the inbound links. But how quickly?

Note - there are some really interesting details hidden in the analysis below, can you spot them?

I have to admit, we overdid it with the links. The main URL ended up with a Google PageRank of 6 (PR 6) at the next PageRank update. I guess you could say that something like that would get it indexed pretty quickly.

So how did it go?

Here’s a short time line for crawling and indexing on Google:

  • T+0 days: Site went live
  • T+0.1d (3 hours later): GoogleBot crawls linked URLs
  • T+0.8d: the 4 linked URLs are fully indexed and listed on Google, ranking #1 for the keyword
  • T+2.1d: GoogleBot crawls linked URLs again
  • T+2.4d: GoogleBot crawls level 1
  • T+3.0d: GoogleBot starts crawling level 2
  • T+4.0d: GoogleBot crawls more of level 3
  • T+4.2d: The 4 URLs drop in rank to #2 for the keyword
  • T+6.1d: 13 URLs indexed (mostly from level 3)
  • T+7.1d: 15 URLs indexed (still mostly from level 3)
  • T+7.2d: GoogleBot crawls level 1 completely
  • T+8.5d: 21 URLs indexed (now more from level 1)
  • T+8.7d: GoogleBot crawls level 1 again
  • T+9.5d: 23 URLs indexed
  • T+11.2d: back to 13 URLs indexed (same as from T+6.1d – perhaps a backup of a datacenter?)
  • T+14.1d: 16 URLs indexed
  • T+14.2d: more of level 2 and 3 is crawled (serious crawling frenzy :-))
  • T+15.0d: Total 784 distinct URLs crawled up to now (17x main URL, 16x robots.txt)
  • T+16.0d: 784 URLs indexed, sporadic crawling
  • T+26.0d: serious crawling starts again
  • T+30.0d: Total 909 unique URLs crawled (37x main URL, 32x robots.txt)
  • T+33.5d: 903 URLs indexed
  • T+37.1d: Serious crawling starts, lasts about 9 hours each time
  • T+48.0d: Serious crawling starts
  • T+57.1d: Serious crawling starts
  • T+70.3d: Serious crawling starts
  • T+80.0d: Total 911 of 911 unique URLs have been crawled so far (102x main URL, 106x robots.txt, 7296 crawler accesses), still 903 URLs indexed

So – indexed in less than a day, almost fully indexed in a bit more than two weeks, not bad!

We noticed that Yahoo! gets data as a domain registrar, but will it also count the links we laid down to the test site?

Timeline for crawling and indexing on Yahoo!

  • T+0 days: Site went live
  • T+1.5d: main URL + robots.txt crawled
  • T+1.7d: Linked URLs crawled (repeated 2-3x daily)
  • T+14.1d: 3 URLs are indexed (3 of the links), ranking #2 for the keyword
  • T+15.3d: 4 URLs indexed (links)
  • T+16.6d: Level 1 crawled (still crawling linked URLs 2-3x daily)
  • T+26.2d: sporadically crawling some other URLs, seemingly on accident, perhaps 1 per day - continued until the end
  • T+35.4d: 6 URLs indexed
  • T+38.5d: 17 URLs indexed
  • T+39.1d: 21 URLs indexed (goes up to 24, sometimes down to 18)
  • T+55.6d: 30 URLs indexed
  • T+73.4d: 41 URLs indexed, sometimes 4, mostly 41, up to 47
  • T+80.0d: Total 90 unique URLs crawled, (379x main URL, 1363x robots.txt, 3046 crawler accesses), 58 URLs indexed

Not very impressive. It seems that perhaps Yahoo didn’t value our inbound links as much as Google?

How about MSN?

Timeline for crawling and indexing on MSN

  • T+0 days: Site went live
  • T+0.1d (3 hours): crawled main URL + robots.txt
  • T+1.7d: 2 URLs indexed (main URL + one of the links (?))
  • T+2.7d: full level-1 crawl
  • T+3.2d: another full level-1 crawl
  • T+3.5d: 5 URLs indexed
  • T+4.3d: crawls the linked URLs
  • T+4.4d: starts level-2 crawl
  • T+5.5d: 12 URLs indexed
  • T+5.5d: starts level-3 crawl (combined with level 2)
  • T+15.0d: Total 60 unique URLs crawled (6x main URL, 31x robots.txt)
  • T+27.6d: 14 URLs indexed, up to 15
  • T+30.0d: Total 95 unique URLs crawled (16x main URL, 56x robots.txt)
  • T+43.8d: 12 URLs indexed
  • T+44.5d: 16 URLs indexed
  • T+46.3d: 7 URLs indexed
  • T+48.6d: 4 URLs indexed (those linked)
  • T+50.0d: 3 URLs indexed
  • T+52.9d: 5 URLs indexed
  • T+54.4d: back to 3 URLs indexed
  • T+56.1d: 8 URLs indexed
  • T+72.2d: 9 URLs indexed
  • T+80.0d: Total 96 unique URLs crawled (40x main URL, 141x robots.txt, 949 crawler accesses)

Not much more impressive than Yahoo!…

Comparing crawling and indexing on Google, Yahoo! and MSN

Comparing those numbers for the 80 day test period:

Google:

  • Total URLs crawled: 911 (100%)
  • Crawler accesses: 7296 (91.2/day, 8.0/URL crawled)
  • Indexed: 903 (99.1%)

Yahoo!:

  • Total URLs crawled: 90 (9.9%)
  • Crawler accesses: 3046 (38.1/day, 33.8/URL crawled)
  • Indexed: 58 (6.4%)

MSN:

  • Total URLs crawled: 96 (10.5%)
  • Crawler accesses: 949 (11.9/day, 9.9/URL crawled)
  • Indexed: 9 (1.0%)

It’s hard to make any conclusions about the Yahoo! and MSN crawling / indexing results. It could be anything from them not valuing the links as much as Google, to them waiting for further dynamic linking, or even them recognising the test site as such and not wanting to index it further (which I doubt, though). At any rate, Google is pretty efficient, averaging 8 crawler accesses per URL - compared to Yahoo’s 33.8 accesses per crawled URL…

Perhaps the reason the robots.txt was crawled so often is because the file did not exist on the server. Since the result code 404 means “not found” it could be interpreted as “currently not found, try again later”.

Other interesting visitors

There were many other interesting visitors; since the site was linked, many crawlers found it. Since it was indexed on the major search engines, there might also have been real “human” visitors as well.

  • T+0 - 10:
  • T+4.7d: Visit from “psbot/0.1+(+http://www.picsearch.com/bot.html)” from picsearch.com, main URL + robots.txt, no referrer
  • T+8.8d: Visit by “geniebot+(wgao@genieknows.com)” only to robots.txt
  • T+8.8d: Visit by “geniebot+wgao@genieknows.com” to one linked URL
  • T+10.1 - 20:
  • T+12.7d: Visit by “libwww-perl/5.76” from 212.227.22.224 to the main URL, kontakt.htm, impressum.htm, impress.htm, and the linked URLs
  • T+14.5d: Wherever would we be without Cyveillance watching over us? User agent “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+XP)”, IP 38.118.42.35, 2 URLs no stylesheet
  • T+15.7d: Visit by someone without a user-agent, from IP 208.219.207.2 (UUNet), no referrer, just the main URL, no stylesheet
  • T+16.4d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)” from an IP belonging to Google (64.233.173.86), 2 URLs, no stylesheet, referrer the site with the hidden links.
  • T+16.4d (2 seconds later): Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)” from IP 200.114.218.85 belonging to Cablevision S.A., Argentina - only to the stylesheet file. Hmm..
  • T+18.5d: Visit by “ConveraCrawler/0.9d+(+http://www.authoritativeweb.com/crawl)” to 139 unique URLs
  • T+18.5d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)” from an IP 80.58.10.172 known as “(numbers).proxycache.rima-tde.net”, accessing only the stylesheet, referred from the Google Cache, hmm..
  • T+20.1 - 30:
  • T+20.7d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)” from IP 85.206.79.170, 3 URLs + stylesheet, referred from a strange Google query
  • T+27.4d: Visit by “ia_archiver-web.archive.org” to robots.txt
  • T+27.4d: Visit from “Snappy/1.1+(+http://www.urltrends.com/+)” from IP 205.138.199.126 (urltrends.com), main URL, referrer “http://www.urltrends.com/”
  • T+26.9d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+{3DC6C7E9-A9D7-9DAE-6164-A9342F7432D8};+.NET+CLR+1.1.4322)” from “Charter Communication”, 1 URL + stylesheet, referred from a strange Google query
  • T+29.7d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.0;+Q312461)” from IP 12.175.0.44 (AT+T / Nameprotect.com), 1 URL, no stylesheet (perhaps a bot?)
  • T+30.1 - 40:
  • T+30.6d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)” from a cable modem user, 1 URL + stylesheet, no referrer
  • T+31.4d: Visit from “Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-GB;+rv:1.7.10)+Gecko/20050717+Firefox/1.0.6” from IP 72.14.192.32 from Google, 2 URLs no stylesheet, referred from a linked page
  • T+33.4d: Visitor without user-agent, IP 66.109.49.55 (Albany College of Pharmacy), main URL + 6 level-1 URLs, no stylesheet, no referrer
  • T+35.0d: Visit from TotalDomainData with “Mozilla/5.0+(X11;+U;+Linux+i686;+en-US;+rv:1.7.5)+Gecko/20041107+Firefox/1.0”, just main URL no stylesheet, no referrer
  • T+35.2d: Another Cyveillance crawl, “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+XP)”, IP 63.148.99.234, 50 URLs with stylesheet
  • T+35.7d: Visit by “Combine/2.0” to one of the linked URLs
  • T+39.5d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.1.4322)” from IP 62.190.227.81 (pipex.com), just the linked URLs, no stylesheet, no referrer
  • T+40.1 - 50:
  • T+40.2d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)” from an IP from Google (72.14.192.32), 2 linked URLs, no stylesheet, referrer from a link
  • T+41.4d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)” from a dialup user, 1 URL + stylesheet, weird query on bluewin.ch search
  • T+42.1d: Visit by “Mozilla/5.0+(Windows;+U;+Windows+NT+5.0;+en-US;+rv:1.7.12)+Gecko/20050915+Firefox/1.0.7” from an IP from Google (72.14.192.32), 2 URLs no stylesheet, referred by the links
  • T+42.1d (3 seconds later): Visit by “Mozilla/5.0+(Windows;+U;+Windows+NT+5.0;+en-US;+rv:1.7.12)+Gecko/20050915+Firefox/1.0.7” from 83.236.20.187 (a .de IP), probably a proxy, just the stylesheet
  • T+44.2d: Visit from “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.0;+.NET+CLR+1.1.4322)” from one of our “hidden” links, from an IP belonging to Google (72.14.192.32), 6 URLs, 2 URLs at each time, separated by about 25 minutes, no stylesheet access.
  • T+44.4d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322;+Alexa+Toolbar)” from an IP 72.14.194.28 from Google (with the Alexa Toolbar, strange :-)), 2 URLs, repeated 4 times within 30 minutes, referred from a link
  • T+46.2d: Visitor without user-agent, IP 213.239.197.107 (your-server.de), 1 URL, no stylesheet, no referrer
  • T+46.2d (45 seconds later): Visitor without user-agent, IP 194.152.185.6 (whoiswho.li) with a POST-request, 1 URL, no stylesheet, no referrer
  • T+46.2d (2 seconds later): Visitor without user-agent, IP 213.239.197.107 (your-server.de), 1 URL, no stylesheet, no referrer
  • T+46.6d: Visitor without user-agent, IP 194.152.185.6 (whoiswho.li) with a POST-request, 1 URL, no stylesheet, no referrer
  • T+48.2d: Visit from “Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+de-DE;+rv:1.7.6)+Gecko/20050321+Firefox/1.0.2” from a cable modem user, with a strange Google query, 1 URL + stylesheet + favicon.ico
  • T+48.2d: Visitor without user-agent, IP 61.221.199.204 (hn.hinet.net, Taiwan), 1 URL, no stylesheet, no referrer
  • T+48.2d (49 seconds later): Visitor without user-agent, IP 80.58.20.173 (proxycache.rima-tde.net) with a POST-request, 1 URL, no stylesheet, no referrer
  • T+48.4d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.0;+.NET+CLR+1.1.4322)” from a dialup-IP, 1 URL + stylesheet, referred from a strange query on Google.dk
  • T+50.1 - 60:
  • T+50.0d: Visitor without user-agent, IP 61.221.199.204 (hn.hinet.net, Taiwan), 1 URL, no stylesheet, no referrer
  • T+50.0d (1 minute, 13 seconds later): Visitor without user-agent, IP 213.239.197.107 (your-server.de) with a POST-request, 1 URL, no stylesheet, no referrer
  • T+50.4d: Visitor without user-agent, IP 168.143.113.52 (anonymizer.com), 1 URL, no stylesheet, no referrer
  • T+52.4d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.0;+Aulas+de+Inform�tica+UAH)” from IP 193.146.11.197, 1 URL + stylesheet
  • T+52.5d: Visit from “psbot/0.1+(+http://www.picsearch.com/bot.html)” from picsearch.com, main URL + robots.txt, no referrer
  • T+54.5d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.0;+gnim;+.NET+CLR+1.1.4322;+.NET+CLR+2.0.50215)” from a phone provider (Cavalier Telephone), 1 URL + stylesheet
  • T+56.8d: Visit by ConveraCrawler again (195 unique URLs)
  • T+57.3d: Visit from “Snappy/1.1+(+http://www.urltrends.com/+)” from IP 205.138.199.126 (urltrends.com), main URL, referrer “http://www.urltrends.com/”
  • T+60.1 - 70:
  • T+62.3d: Visit from “Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.7.10)+Gecko/20050716+Firefox/1.0.6” from IP 64.233.173.86 from Google, 1 URL no stylesheet, referred from a linked page
  • T+62.8d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)”, cable modem user, 1 URL + stylesheet, referred from a strange Google query
  • T+64.6d: Another Cyveillance crawl, “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+XP)”, IP 63.148.99.234, 50 URLs with stylesheet
  • T+65.5d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)” from a .pt-IP, 1 URL + stylesheet, referred from “mysearch.myway.com/jsp/GGweb.jsp(…)”, perhaps a Google scraper?
  • T+65.9d: Visit from “SurveyBot/2.3+(Whois+Source)” from IP 64.246.165.150, main URL + robots.txt, referrer “http://www.whois.sc”
  • T+68.6d: Visitor without user-agent, IP 221.126.80.167 (hgc.com.hk, hong kong), the 4 linked URLs, no stylesheet, no referrer
  • T+70.1 - 80:
  • T+70.9d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.0;+.NET+CLR+1.1.4322)” from a roadrunner user, 1 URL + stylesheet
  • T+71.9d: Visit from “Twiceler+www.cuill.com/robots.html” from IP 64.62.136.197, only robots.txt, no referrer
  • T+72.1d: Visit by “DataSpearSpiderBot/0.2+(DataSpear+Spider+Bot;+http://dssb.dataspear.com/bot.html;+dssb@dataspear.com)” to the linked URLs
  • T+72.4d: Visit by “Mozilla/4.0+(compatible;+MSIE+5.5;+Windows+98;+DT)” to a single URL + stylesheet, no referrer (IP is from a dial-up)
  • T+72.4d: Visit from “Shim+Crawler” from IP 157.82.246.104 (University of Tokyo), main URL + 2 linked URLs + robots.txt, no referrer
  • T+72.5d: Visit from “Mozilla/5.0+(X11;+U;+Linux+i686;+en-US;+rv:1.7.8)+Gecko/20050513+Fedora/1.0.4-1.3.1+Firefox/1.0.4” from IP 64.235.246.5 (“Oversee.net”?), 1 URL no stylesheet, no referrer
  • T+74.3d: Visit from “SurveyBot/2.3+(Whois+Source)” from IP 216.145.11.94, main URL + robots.txt, referrer “http://www .whois.sc”
  • T+74.4d: Visit from “Mozilla/5.0+(X11;+U;+Linux+i686;+en-US;+rv:1.6)+Gecko/20040113”, from bulldogdsl.com, 1 URL no stylesheet, no referrer
  • T+74.5d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)” from a DSL-user, 1 URL + stylesheet, referred from a strange Google query
  • T+74.9d: Another visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.0;+Q312461)” from IP 12.175.0.44 (AT+T / Nameprotect.com), 1 URL, no stylesheet (perhaps a bot?)
  • T+75.0d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.0)” to a single URL (without stylesheet access though), no referrer, IP from bulldogdsl.com
  • T+75.0d: Visit from “Opera/7.23+(Windows+98;+U)+[en]” on IP 84.9.48.121 (bulldogdsl.com), 1 URL no stylesheet, no referrer (is that really a dsl user? Hmm)
  • T+76.9d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+Q312461;+telus.net_v5.1.1)”, cable-modem user, 1 URL + stylesheet, referred from a strange Google query
  • T+77.0d: Visit by DataSpearSpiderBot again (same linked URLs)
  • T+77.0d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)” from a telecom in San Salvador (.sv), 1 URL + stylesheet (referred from a strange Google query)
  • T+77.7d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)”, cable modem user, 1 URL no stylesheet, no referrer…
  • T+78.2d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)”, dialup user, 1 URL no stylesheet, referred by a strange Google query
  • T+79.2d: Visit by “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322;+Avalon+6.0.4030;+%WAP+version%;+.NET+CLR+1.0.3705;+.NET+CLR+2.0.50215)” from a cable modem user, 1 URL + stylesheet, searching Google for links to our linking pages
  • T+79.6d: Visit from “Opera/7.23+(Windows+98;+U)+[en]” on IP 87.74.37.159 (bulldogdsl.com), 1 URL no stylesheet, no referrer

So Google appears to have checked our site (and/or the sites linking to it) using MSIE and Firefox user-agents at several times, found and followed our hidden (not hidden in Googles eyes in that case :-)) and visible links. I wonder if that had any impact on the rest of the crawling? Since the site did not use Adsense, I suppose it left Google scratching its head :-).

The “strange Google queries” are different queries for words that are “naturally” found on our test site, but which seem to be returned high enough in the search results for users to click on them. (There were no visitors from queries on MSN or Yahoo!.) That is how spammy sites work, I suppose.

I guess you could say getting links is worth it, at least for Google. We realise we overdid it with the links, but it did give us some interesting traffic… Hmm, who are all those unknown people / bots?

Warning: This page is pretty old and was imported from another website. Take any content here with a grain of salt.

Comments / questions

There's currently no commenting functionality here. If you'd like to comment, please use Twitter and @me there. Thanks!

Tweet about this - and/or - search for latest comments / top comments

Related pages