People who are new to the web and want to start with a website usually just put it online and hope that visitors come. With Google Sitemaps the webmaster has a way to let Google know about his site and to try to help Google find all of the pages.
I’ll just go through the other sites in the order we had them, site A now, next site B, then site D (we already covered site C) and finally site E. The further we get along, the clearer the factors that make Google start indexing will be. Perhaps you can even guess correctly after this post?
Our site “A” is somewhat similar to a new webmaster; it is just a normal website which uses Google Sitemaps. Since a new webmaster usually has no idea that he/she can get links, we also left this site alone with no links at all.
So, will Google get interested and index our site?
Sure! After just a few days Google had indexed the main URL, 8 days after that Google had positioned that URL in the search engine for a query on our keyword (total 11 days). However, it stayed at that. No other URLs were indexed, the sitemap file was not used, Google only had eyes for the main URL. In fact, that URL also popped out of Google a few times, only to pop back in after a few hours.
Until Oct. 1, 2005 Google would check our sitemap file (and the robots.txt) 4 times a day, once in the morning, once in the evening - each time two times right after another. From Oct 1 to Oct 6 it didn’t check at all. From Oct 6 to Oct 20 it checked in a strange interval, some days up to 10 times, with minutes or hours in between. From Oct 20 on it checked only twice, once in the morning and once in the evening (the exact times varied) - no more double-checking. Google used the normal Google-Bot user agent to access the files: “Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)”.
A short time line for Google, crawler and index status:
- T+0 days: site went live
- T+2.8d: Google lists “/” without description; indexed but not crawled
- T+7.3d: GoogleBot crawls “/” with “Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)": first and only time.
- T+11.2d: Google lists the crawled “/” with description in the search results
Now, let’s take a look at Yahoo …
- T+11.5d: Yahoo-Bot crawls “/” with “Mozilla/5.0+(compatible;+Yahoo!+Slurp;+http://help.yahoo.com/help/us/ysearch/slurp)”
- few minutes later:
- T+11.5d: Yahoo-Bot crawls “/6-…”, then “/66-…” and “/664-…” (3 levels, just one URI per level)
- T+16.8d: Yahoo lists URL “/” with description
- T+18.9d: Yahoo lists 2 URLs
- T+19.3d: Yahoo-Bot crawls first level completely
- T+20.xd: ..later… Yahoo-Bot starts crawling second level, not very much though
- T+20.xd: ..also… Yahoo-Bot crawls “/” and “/robots.txt” daily
- T+21.4d: Yahoo lists 3 URLs
- T+22.9d: Yahoo lists 4 URLs
- T+24.xd: ..step by step.. Yahoo-Bot starts crawling third level (second level not completely crawled)
- T+26.0d: Yahoo lists 5 URLs
- T+32.3d: Yahoo lists 27 URLs, then 33, then 20, then 14 (huh?)
- T+32.6d: Yahoo goes back to 5 URLs
- T+35.4d: Yahoo lists 14 URLs, then 15
Yahoo crawls a total of 76 URLs (out of 911); better than the GoogleBot, but not amazing.
MSN? Never came to take a look. Probably didn’t know they were missing anything, or even worse: knew they were missing nothing.
Other interesting activity:
- T+18.6d: Google 404-file probe (checked for /GOOGLE404probe7467d2b431e5224e.html using user-agent “Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)” from IP 22.214.171.124
- T+35.2d: Visited by totaldomaindata.com with user-agent “Mozilla/5.0+(X11;+U;+Linux+i686;+en-US;+rv:1.7.5)+Gecko/20041107+Firefox/1.0” from IP 126.96.36.199
- T+36.3d: Visit by Cyveillance + Crawl of 51 URLs (using user agent “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+XP)” from IP 188.8.131.52
- T+49.4d: Google 404-file probe (checked for /GOOGLE404probef3779d1423d3d117.html using user-agent “Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)” from IP 184.108.40.206
- T+65.4d: Visited by Cyveillance again, 50 URLs crawled (same IP / user-agent)
- T+79.8d: Visited by a human using Opera :-)
Other than Google, no search engine checked invalid URLs to see how we handle “file-not-found” errors.
So what did we learn from Site A?
Google Sitemaps alone do not make Google crawl and index a site.
There is some limiting factor that keeps Google from indexing sites, even if they use Google Sitemaps. Your guess is as good as mine (well, almost :-)). Perhaps it IS that legendary PayPal account I’ve heard about on the web? (probably not :-) – stay tuned)
A site that only has Google Sitemaps will get better indexed on Yahoo than on Google. Absurd, no? But I wouldn’t call 15 listed URLs (out of 911) a good result. We’ll need to find something better!
Comments / questions
There's currently no commenting functionality here. If you'd like to comment, please use Twitter and @me there. Thanks!