I know - everyone just wants the results of our small study - but let me give you all a small insight in the way we tested, how we set up our site and what we logged.
Domain and server setup
Our domains were 6 characters, the same for all test sites, followed by an identifier for the sites. That could be something like “kwekuqA.com”,“kwekuqB.com”, “kwekuqC.com”, “kwekuqD.com”, etc.. We tested several variants of the starting characters to make sure that we use one which doesn’t have anything associated with it in Google (or at least not that much). In this example, “kwekuq” returns 4 results in Google, so that would be something to go with. (Believe it or not, it’s hard to find a group of random characters that doesn’t have anything associated with it in Google!) After finding the domain names, we made sure that there was no history associated with them (using archive.org and whois).
After we chose the domain names, we waited until we had the rest of the server set up to register them (it’s not like anyone would have taken them before us :-) ). We did that to make sure that we start off with really fresh domains.
We chose an old Windows 2000 Server which one of our ISPs let us use, all sites share an IP-address (which is probably not the best for a production site, but for the tests it should be ok).
The sites were created using English texts with expired copyrights (from Project Gutenberg - www.gutenberg.org). The texts were split into paragraphs, sentences and headers and randomly mixed. The resulting sites look similar to http://gsitecrawler.com/tests/meta-tag-date/ . There is a main page (index.php), which links to the first level. The first level pages (named [number]-text.php, eg “5-some-text.php”) link to the index page and to the second level, the second-level pages (eg “59-some-text.php”) link to the third-level and to the index, the third-level pages (eg “592-some-text.php”) link only to the index page. This layout lets you follow crawling through the pages. The text following the level-number was the title of the page.
In addition to the English content, we added our keyword in 10% of the sentances and headers. The pages were optimized a little bit for SEO, the page title was set in H1, the other sub-titles in H2, the keywords and description contained the titles. The pages used an external CSS file (this also let us check for real browsers, in case any real visitors happened along + let us check to see if any search engines really read external CSS files). There was no robots.txt and no favicon.ico file. Some sites included a Google Sitemap file (not called sitemap.xml, to check if any crawler would try it by itself).
We logged file access through two ways - a) with a small logger-script included in the php-files and b) with the webserver itself. Looking back, it would have been enough to just log using the webserver, but sometimes you want to be sure. On the server we logged everything we could; IP, user agent, referrer, etc.. It is best to log too much, you can always sort it out if you don’t need it.
As mentioned in the opening post, we logged the URLs that were indexed (using the site:-query) and the placement in the SERPs for the keyword we choose. We did this hourly for Google, MSN and Yahoo. We assumed that by querying the search engines we would not influence the indexing or placement, which turned out to be true (as far as we could tell). By following the results and the log files, we hoped to be able to figure out from where a search engine had found us, which way it used to crawl our site and which criteria was used to determine which URLs would be added to the index.
So now that you know how we did it, go out and run some test sites yourself! It’s not that hard and lots of fun!
(I’m not sure, did I say I was going to post some results as well?)
Comments / questions
There's currently no commenting functionality here. If you'd like to comment, please use Twitter and @me there. Thanks!