Google and their Sitemap (old, probably outdated)

Warning: This page is pretty old and was imported from another website. Take any content here with a grain of salt.

The Google Sitemap system just turned one year old! The “Big Daddy” infrastructure and the “Crawl Caching Proxy” look like they were made to be a perfect match for Google Sitemaps (but it is more likely the other way around). In theory, Google Sitemaps can tell the Google crawlers more about a website, even without having to crawl it. The attributes can be used to help the proxy determine when actual accesses are necessary, keeping bandwidth use on all sides at a minimum.

It is hard to take a test-site and determine the influence of Google Sitemaps. It is even harder to take a live site and measure change, especially with all the changes that have taken place in Google’s algorithm lately. A lot of the influence also seems to depend on the “value” of a website: on better websites (as seen by Google) the URLs listed in the sitemap file could be indexed even without a direct link elsewhere on the site: indexing the “hidden web”, the unlinked part, of a website. What should we do for a test?

Well… What would Google do?

Let’s take their sitemap file and see how well it works for them. Their sitemap file, after all, is shown as an example for every webmaster who tries to submit a new sitemap file, it’s probably perfect.

Well, …. not quite.

For this posting I used yesterdays sitemap file from http://www.google.com/sitemap.xml - I cached a copy of the file I used on my server.

To be fair, Vanessa Fox (Sitemaps product manager) said that the Google sitemap file is more of an example than a real, used sitemap file; but I’m sure I’m not the only one with it submitted as a sitemap file in my Sitemaps account to track changes. :-)

So let’s take a look:

Last modification date: 12. September 2005 01:42:31 UTC

Oops, that’s pretty old. Is Google not interested in updating the sitemap file? Or do they trust their crawlers to work properly? Or do they just have their site crawled differently?

The sitemap file contains 1668 URLs, specifying just the URL and the last modification date of the URL.

The newest URL (http://www.google.com/holidaylogos.html) is dated 2005-05-09T17:55:22+00:00 (5. September 2005, 17:55:22 UTC). So the last modification date for the sitemap file itself could be correct.

In my opinion it is bad practice to provide a sitemap file with last modification dates if they are not updated. That would surely confuse the proxy and possibly keep it from accessing the newer versions of the URLs on the server. (Note from me: don’t do this with your sitemap file! If you do not want to update the sitemap file regularly, leave the last-modification date out of the sitemap file and let Google’s proxy access your server for that information.)

Imagine Google re-activating the URLs from September from its cache because you told it that it was the last change you did. Many webmasters complained that Google seemingly reset the index to a version from last fall, could this have something to do with that?

Of the 1668 URLs listed, 197 of them don’t exist on Google’s server and 6 of them are even banned by their robots.txt. I did not check the robots meta-tags for the existing ones. So roughly 12% of the sitemap file is invalid. It’s just an example of a sitemap file.

It’s an example of what not to put in your sitemap file!

Let’s look at the existing URLs. Are they at least in the index? 655 out of the 1465 existing URLs are not in the index. That’s almost 45%. You would think a site with a PR10 homepage would at least have the URLs from their sitemap file in their own index?

Google’s site contains (as of yesterday) 43'019 indexable URLs or 16'884 crawlable URLs (with an adjusted robots.txt - see below [*] ). Their sitemap file contains 1465 existing URLs - so they included roughly 8% of their crawlable URLs (html pages) or 3% of their indexable URLs (html files and other documents like graphics, PDF files, etc.).

They listed at most 8% of their crawlable URLs and had 55% of those indexed.

Google must crawl differently, a site:-query (archive.org) for their domain brings “about 21'300'000” URLs, of those only 952 are listed, for the rest they say: “In order to show you the most relevant results, we have omitted some entries very similar to the 952 already displayed.”

Something doesn’t seem to add up over at Google - “Bistromathics (archive.org)” at it’s best!

[*] Google’s current robots.txt allows indexing of URLs which do not make sense to get indexed, eg duplicate content or user created content. I added the following disallow-blocks when crawling their website (updated robots.txt): /accounts/, /alpha/, /coop/, /ig/, /movies/, /reviews? and /Top/.

Warning: This page is pretty old and was imported from another website. Take any content here with a grain of salt.

Comments / questions

There's currently no commenting functionality here. If you'd like to comment, please use Twitter and @me there. Thanks!

Tweet about this - and/or - search for latest comments / top comments

Related pages