Commandline

Bots that impersonate Googlebot

Anyone can act like a bot just by using the Googlebot useragent in a request. Sometimes crawlers do that to see what other bots might see. Sometimes it’s to circumvent robots.txt directives that apply to them, but not to Googlebot. Sometimes people hope to get a glimpse at cloaking. Whatever the reason, these kinds of requests can be annoying since they make log file analysis much harder. Motivation for this excursion:

Check if a static site is moved correctly

The lazy person’s guide to confirming that a move to a static site worked. Overview: Download all relevant URLs from Search Console Convert download to a URL list Check for http to https redirects Check for valid final URLs Download all relevant URLs I’m picking one approximate source of truth - the URLs that received impressions in Google Search. This list doesn’t need to be comprehensive, just something more than I’d manually pick.

Mirroring a website for use on static hosting

If you’ve been following along, or probably not, since none of this is live yet, I’ve been moving some of my random old sites over to static hosting to simplify life. Static hosting doesn’t solve everything, and doesn’t protect your cheese, but it’s cheap & carefree (at least, until your hoster deprecates their static hosting). Finding new places to put static hosting is pretty straightforward too. I use Firebase static hosting for this site at the moment, that’s what this post covers.

Converting ancient content into markdown

So you want to get that ancient blog post or article that you put up and convert it into Markdown? This is what has worked for me. YMMV. Find the content Crawl and mirror the site to get everything This is kinda easy, and makes it possible for you to get all of the content that’s linked locally. It doesn’t make viewing them in a browser easy though. This uses wget.

Crawl a website to get a list of all URLs

Sometimes you just need a list of URLs for tracking. There exist loud & fancy tools to crawl your site, but you can also just use wget. wget --mirror --delete-after --no-directories http://your.website.com 2>&1 | grep '^--' | awk '{print $3}' | sort >urls.txt or, if you just want the URL paths (instead of http://example.com/path/filename.htm just /path/filename.htm): wget --mirror --delete-after --no-directories http://your.website.com 2>&1 | grep '^--' | awk '{print $3}' | sort | sed 's/^.

Command lines for the weirdest things

Just a collection of command line tweaks. These work on Linux, probably mostly on MacOS too (and who knows, with things for Windows probably). (content irrelevant, just adding an image) Basics General command-line tips Todo: find some other sources Pipe tricks: | more - show output paginated (use space to get to next page) | less - similar to more, but with scroll up/down | sort - sort the output lines alphabetically | uniq - only unique lines (needs β€œsort” first) | uniq -c - only unique lines + include the number of times each line was found (needs sort first) | sort -nr - sort numerically (β€œn”) and reverse order (β€œr”, highest number first) | wc -l - count number of lines Searching for things (more grep options)

Crawling all (most) of the web's robots.txt comments

Starting from this tweet … View tweet … I hacked together a few-lined robots.txt comment parser. I thought it was fun enough to drop here. Crawling the web for all robots.txt file comments curl -s -N http://s3.amazonaws.com/alexa-static/top-1m.csv.zip \ >top1m.zip && unzip -p top1m.zip >top1m.csv while read li; do d=$(echo $li|cut -f2 -d,) curl -Ls -m10 $d/robots.txt | grep "^#" | sed "s/^/$d: /" | tee -a allrobots.txt done < top1m.

Using Curl to add rows to a Google Spreadsheet without using an API

Adding content to a Google Spreadsheet usually requires using the Spreadsheet API (archive.org), getting auth tokens, and tearing out 42 pieces of hair or more. If you just want to use Google Spreadsheets to log some information for you (append-only), a simple solution is to use a Google Form (archive.org) to submit the data. To do that, you just need to POST data using the field names, and you’re done. The data is stored in your spreadsheet, you even get a timestamp for free.

Confirm that you're using Analytics on all pages

Here’s something from my mailbox - someone wanted to know how he could crawl his site and confirm that all of his pages really have the Google Analytics (archive.org) tracking-code on them. WordPress users have it easy, there are plugins that handle it automatically. Sometimes it’s worth asking nicely :) - let me show you how I did it. As a bonus, I’ll also show how you can check the AdSense ID on your pages, if you’re worried that you copy/pasted it incorrectly.

How to use Google webmaster tools stats with Excel

Google’s webmaster tools (archive.org) has a neat feature that lets you download your query and click statistics (once you have verified ownership of your site). The data you can get from there is quite comprehensive, but hard to break down for use in Excel. As a fun exercise I put together a small Python-script that takes the CSV file downloaded from your webmaster tools account and turns it into new CSV files for queries and for clicks (both with the position numbers as well).