Here’s something from my mailbox – someone wanted to know how he could crawl his site and confirm that all of his pages really have the Google Analytics tracking-code on them. WordPress users have it easy, there are plugins that handle it automatically. Sometimes it’s worth asking nicely – let me show you how I did it. As a bonus, I’ll also show how you can check the AdSense ID on your pages, if you’re worried that you copy/pasted it incorrectly.
This is pretty much cross-platform, but as a Windows-user you’ll have to grab and install two files first:
- wget – a tool to download copies of web pages
- UnxTools – a collection of popular Unix/Linux tools for the hacker in you
Extract the ZIP files, copy the contents somewhere where you can find it and make sure that the appropriate folders are in your “path” (the files you’ll need for UnxTools are in “…\usr\local\wbin”). We’ll need to access these tools through the command line. I have a feeling I may need to elaborate on that for Windows users — let me know if that’s the case.
First, we’ll mirror our site on our local machine (this assumes that your site is crawlable; if it isn’t, then fix it first ):
- Open a command box or terminal window (on Windows, hit Start / Run … and enter “cmd”)
- Go to or create a temporary folder
- Run the following command to mirror your site:
wget --mirror --accept=html,htm,php,asp,aspx http://domain.com/
This command mirrors pages with .html, .htm, .php, .asp and .aspx extensions on http://domain.com/. It’ll create a folder for the domain and put all the files in it. Dynamic URLs will get adjusted so that they can be used as file names.
- Wait … until it’s all downloaded … if it feels endless, you might have endless URLs, perhaps an infinite calendar script or something similar? It’s worth fixing!
Alrighty, now that we have a copy of your site, let’s check things out.
Finding pages without Analytics
We can find pages without the Analytics tracking code by listing all pages which do not have certain content in them:
grep -r -L "google-analytics.com" *.*
This command goes through all subfolders (the “-r” option) and lists the files that do not contain a match (“-L”) for “google-analytics.com”. That could be extended to just about anything :).
How about pages that don’t have a “description” meta tag?
grep -r -L "meta name=.description" *.*
The “.” (period) matches any character — in this case, it is used to match the ” (double-quote).
Finding pages with AdSense (and the ID used)
Finding pages that contain a certain text is even easier:
grep -r "google_ad_client" *.*
Note that all we did was drop the “-L” (and change the text, obviously). It will show the lines that match this pattern in all of your pages, which includes the AdSense ID.
Similar to the earlier check for missing “description” meta tags, assuming you have the contents of that tag all in one line, you can easily find all of these meta tags with:
grep -r "meta name=.description" *.*
What would you like to search for today?