Google and the other search engines are constantly changing their software and infrastructure. Google has apparently switched to a new infrastructure starting beginning 2006 and is currently working on optimizing the “settings”.
How does all of this show in the test sites? How does it show in a normal site? Do the number of indexed pages for a “spammy” site go down? Does the activity of the crawlers change? What are the other engines doing?
For tracking we have two types of trackers:
- We track the first indexed URLs in the major indexes (Google, MSN, Yahoo!) to get a feeling for which URLs were indexed within which timeframe. For our test sites we sadly limited this to 200 URLs (though the sites have over 900 pages); for some other sites we limited this to 500 URLs. Since we are querying automatically we do not want to do too many queries (yes, I know, we shouldn’t be querying automatically, but it is almost the only way to run a test series like this for this time frame). We did not use the number shown on top because this number can be wrong at times.
- We track the individual accesses through server logs and through a simple tracking script that also tracks sessions and a bit more.
For our test sites we have tracked all activity since the beginning (August 2005). The number of indexed pages has been more or less stable after the first month, despite us removing most of the links pointing to the test sites, partially in January and the rest in March. Because of the limit of 200 URLs the charts for the number of indexed URLs are quickly at the limit of 200 and therefore not of much use.
However, using the activity data we can chart the number of accesses per search engine bot and day. This brings us the following chart (for all five test sites combined):
The activity of the Googlebot varies from day to day, usually going up over 800 pages/day once a week. The drop in activity when the links were dropped is very visible. The Adsense-Bot activity is also pretty stable, around 50 pages/day, probably depending on the number of pages visited by normal users. MSN and Yahoo! activity is even less than the Adsense-Bot, excepting the peak around mid-march (slightly over 1400 pages on that day).
Let’s now take a look at a very “spammy” type site that was only indexed with well-placed and partially hidden links on other peoples sites (much like spammers would do), for example: homepage-links in guestbooks and forums (of course first checking that the pages the links will be on is visible to Google and that the links are not tagged with “rel=nofollow”). The information content of the site is very low, there are not going to be any users purposely visiting the site again, much less linking to it themselves. The links were placed by hand over a timeframe of about two weeks and not updated since then. The site has been static since the beginning, and shows Adsense on part of the pages (it pays for hosting but not for any of the work, ha ha).
The site has slightly over 10'000 pages (no duplicate content); with the number of links placed there are not going to be many pages indexed however. The site is currently showing 761 pages indexed in Google (at least in the number on top). Here is the chart of the number of URLs indexed on the major engines:
You can see that MSN picked up on the links first, but Google quickly beat the numbers quickly. Yahoo was the slowest of the bunch. It is interesting to see that the numbers for Yahoo and MSN also had a few sharp breaks in them, going up or down. It is also interesting to see that Googles numbers varied from day to day sometimes, perhaps showing that the script was using different datacenters.
Let’s now take a look at the crawler activity over time:
The high peak for “Other” right in the beginning was when I crawled the full site with my GSiteCrawler to generate a sitemap file for it and to make sure that all pages were uploaded properly. It should be noted that the sitemap file also had a tracker script attached which was broken and therefore the sitemap file was not in use until mid-March 2006.
Looking at the chart, the only activity that is really visible is with the Googlebot, crawling between 500 and 3000 pages per day (the black line is a polynomial of 5th order for Googles numbers).
There are two periods that have a “suprising” lack of activity:
- From end of February to mid March 2006.
- From the second week of April 2006 onward.
The peak in activity around mid March might be because of the sitemap file, but it is not possible to know for sure of course. The drop around the second week of April is too short to really analyze properly.
So what can we say from these charts? Looking at the spammy site, the Googlebot activity does not correlate with the number of URLs indexed - the average 1000 URLs crawled per day is not close to the 500-700 URLs in the index.
Sites that were made for Google (like these test sites) show very little influence in the number of URLs indexed on Google over time, even though there were some very large changes in the software and many people have complained about problems. The number of URLs indexed in Yahoo and MSN also varies over time, but since the number is much lower than Googles number, it is not really worth fighting for.
So where does this bring us? It seems that with proper linking you can achieve the most on Google, even with their periodic system changes.
Comments / questions
There's currently no commenting functionality here. If you'd like to comment, please use Twitter and @me there. Thanks!
- Indexing timeline (2006)
- Bots that impersonate Googlebot (2021)
- crawl budget & 404s (2016)
- 307s (2016)
- A search-engine guide to 301, 302, 307, & other redirects (2016)