Measuring the long tail of search (old, probably outdated)

Warning: This page is pretty old and was imported from another website. Take any content here with a grain of salt.

While playing with the AOL search data I came to look at the specific queries that were used to access any particular URL. This information is similar to what you have when you look at your referrer statistics: which queries were used to gain access to your site?

Some background to the AOL search data (archive.org):

The AOL database simplifies the queries a bit and in general contains only the words used in the query, none of the formatting, the operators (eg, “+”, “-”, etc.) or other user-specified data (including geolocation and language based on the user’s IP address). The AOL database also simplifies most URLs down to the main domain name (it removes the path and file names; except for ftp and https-URLs). This makes for an interesting mixture – for example: “www.google.com” contains not only queries for the search engine, but also for the Google directory at places like “www.google.com/Top/Games/Gambling/Lotteries/Regional/United_States/”. In addition to the queries and domains, the database lists the ranking the clicked domain had in the search results (which sometimes makes it fairly easy to track down the real URL). One potential problem with the data however is that a user could have clicked the same domain name multiple times in the same search-run. This could skew the data, but the influence seems to be minimal.

Let’s look at the frequency of the queries used to access a site. On my AOL database site (archive.org) I added an “info (archive.org)” function to list the queries, their counts (when used to click to the domain) as well as the words in general (it lists the first 2000 queries and words). For the queries, it also lists the average rank. It’s a bit slow in getting the data from the database the first time, so be patient!

When charting the queries with their count, you will usually see something like this:

Where does the long tail start? How can we recognise if a site is using the long tail to attract many unique and different visitors?

If you look at the simple graphic on top of the listing on a sample page, you’ll see the count of the first 20 unique queries as well as the total of the rest of the queries. Some sites are “top heavy (archive.org)”, some sites are “tail heavy (archive.org)” (with the bulk of the queries falling into the “rest” bin).

To put a number on the tail-heaviness I tried several things that didn’t work out. (Perhaps I’ll add some of the ways that did not satisfy me to the website as well later on…). The one item that I did find interesting was working out an approximate function in the form of “y = A / x^B”. Functions like this (power laws) are used all over to describe natural distributions, eg Zipf’s Law (archive.org). Without going into details, we can usually ignore the number “A” and concentrate on “B”. “B” describes the curve of the function: is it a tight bend (high values for B ) or more flat (small values for B ).

When looking at the query distribution for a site, large values for B show us that most of the visitors come from very few queries – small values for B show that the visitors come from a large variety of queries. There we have it: a simple number for the tail-heaviness of a site!

Which domain is tail-heavier: wikipedia or amazon? myspace or craigslist? hotmail or yahoo.mail.com?

When looking at the top 20 domains in the AOL data, I found the following values (from tail-heavy to top-heavy):
www.geocities.com (0.86)
www.imdb.com (0.86)
www.bizrate.com (0.98)
profile.myspace.com (0.99)
www.nextag.com (1.00)
www.tripadvisor.com (1.00)
cgi.ebay.com (1.05)
en.wikipedia.com (1.07)
www.amazon.com (1.36)
mail.yahoo.com (1.53)
www.ask.com (1.61)
www.bankofamerica.com (1.65)
www.craigslist.com (1.71)
www.myspace.com (1.73)
www.hotmail.com (1.83)
www.msn.com (1.87)
www.mapquest.com (1.91)
www.yahoo.com (2.17)
www.google.com (2.33)
www.ebay.com (3.07)

One item to keep in mind, however, is that this value does not only describe the tail: it also describes the “brand-awareness” of the site. The more well-known a site is, the more likely it will be searched for with the brand name; the less known a site is, the more likely it will be found with a “long-tail” search. There are of course sites that have a high brand-awareness but also target the long-tail; for example: www.google.com (archive.org)

  • which had hits for over 9300 queries but still had the majority of the users searching for … “Google” (over 250'000 times).

It’s possible to use this same method with your own log files to determine your own long-tail heaviness. With a bit of luck you can even do that in Excel (depending on your log files). I put together a simple Excel sheet (archive.org) to let you play with it - you’ll need to approximate the values for A and B yourself (“B” is “C” in the sheet, to confuse everyone). Just fill in the first 500 of your queries and their count, then play with the values until the “error” value is minimised.

What other ways could you think of to put a number on the long tail? … and where does the long tail start?

Warning: This page is pretty old and was imported from another website. Take any content here with a grain of salt.

Comments / questions

There's currently no commenting functionality here. If you'd like to comment, please use Twitter and @me there. Thanks!

Tweet about this - and/or - search for latest comments / top comments

Related pages