Posted on 21 May 2008 at 0:27 UTC, filed under Tricks, disclaimer

Here’s something from my mailbox – someone wanted to know how he could crawl his site and confirm that all of his pages really have the Google Analytics tracking-code on them. WordPress users have it easy, there are plugins that handle it automatically. Sometimes it’s worth asking nicely :) – let me show you how I did it. As a bonus, I’ll also show how you can check the AdSense ID on your pages, if you’re worried that you copy/pasted it incorrectly.

This is pretty much cross-platform, but as a Windows-user you’ll have to grab and install two files first:

  • wget – a tool to download copies of web pages
  • UnxTools – a collection of popular Unix/Linux tools for the hacker in you

Extract the ZIP files, copy the contents somewhere where you can find it and make sure that the appropriate folders are in your “path” (the files you’ll need for UnxTools are in “…\usr\local\wbin”). We’ll need to access these tools through the command line. I have a feeling I may need to elaborate on that for Windows users :) — let me know if that’s the case.

First, we’ll mirror our site on our local machine (this assumes that your site is crawlable; if it isn’t, then fix it first :D ):

  1. Open a command box or terminal window (on Windows, hit Start / Run … and enter “cmd”)
  2. Go to or create a temporary folder
  3. Run the following command to mirror your site:

    wget --mirror --accept=html,htm,php,asp,aspx

    This command mirrors pages with .html, .htm, .php, .asp and .aspx extensions on It’ll create a folder for the domain and put all the files in it. Dynamic URLs will get adjusted so that they can be used as file names.

  4. Wait … until it’s all downloaded … if it feels endless, you might have endless URLs, perhaps an infinite calendar script or something similar? It’s worth fixing!

Alrighty, now that we have a copy of your site, let’s check things out.

Finding pages without Analytics

We can find pages without the Analytics tracking code by listing all pages which do not have certain content in them:

grep -r -L "" *.*

This command goes through all subfolders (the “-r” option) and lists the files that do not contain a match (“-L”) for “”. That could be extended to just about anything :).

How about pages that don’t have a “description” meta tag?

grep -r -L "meta name=.description" *.*

The “.” (period) matches any character — in this case, it is used to match the ” (double-quote).

Finding pages with AdSense (and the ID used)

Finding pages that contain a certain text is even easier:

grep -r "google_ad_client" *.*

Note that all we did was drop the “-L” (and change the text, obviously). It will show the lines that match this pattern in all of your pages, which includes the AdSense ID.

Similar to the earlier check for missing “description” meta tags, assuming you have the contents of that tag all in one line, you can easily find all of these meta tags with:

grep -r "meta name=.description" *.*

What would you like to search for today?

There are 14 comments to this post.
  1. What’s a “command line”?

  2. Neat! Thanks for posting! I think I’ll just go the non-Windows route for this one. For some reason, I’ve got it in my head that tools like cygwin aren’t good for windows. I base that comment off no factual information or experience using them, it’s just one of those mental hurdles I guess. Although, wget is like lightening compared to FTP clients (it seemed like that at least just now when I tried it) so I maybe I’ll give it a try on Windows.

    P.s. Post more, John – please. :)

  3. Great Post, John. Thank you.

    I tried the download link you have for wget and it returns a blank page. Do you have an alternate link?

  4. narendra.s.v (5 July 2008 at 1:48 pm):

    ofcourse i do use Analytics but never thought about this thing ;)

  5. Hi John,

    Thanks for your advice on the webmaster forums a couple of days ago.

    Do you ever update this blog? :)


  6. Best John mu thanks man

  7. Very well explained. Even if you had never heard of command line previously, the above should make you feel like a pro when you’re done.

  8. Download all the website :O .
    Maybe it’s too much no ? :)

    If the server is slow, you maybe need to restrict the bandwith:
    and the timeout
    –wait=4s or –random-wait


  9. Excellent step by step guide, these are great tips for anyone with large websites. Of course this isn’t a problem for anyone using Joomla or wordpress. Some of my websites are just html and having recently had to go through and change some code on each page I wish I found your tips earlier! As an idea for another tip, how about a guide on how to include adsense or analytics code in an external file?

  10. Hi John, thanks for the Ubuntu / Linux SEO tips! Wish there were more bloggers / developers writing about SEO with Ubuntu. Best, Richard Baxter.

  11. Hi John, great tool.
    Is there something similar for the conversion tracking code?
    I have several AdWords accounts and find comparing the leads that come through to what is registered on AdWords never seems to add up.
    Any tips would be most appreciated.

  12. Ulana Illiano (21 June 2009 at 3:28 am):

    This is great information, John. Thanks!

  13. Thank you – this has been useful – I will install these two files and give this a go.

