Crawling all (most) of the web's robots.txt comments

Starting from this tweet … View tweet … I hacked together a few-lined robots.txt comment parser. I thought it was fun enough to drop here. Crawling the web for all robots.txt file comments curl -s -N \ > && unzip -p >top1m.csv while read li; do d=$(echo $li|cut -f2 -d,) curl -Ls -m10 $d/robots.txt | grep "^#" | sed "s/^/$d: /" | tee -a allrobots.txt done < top1m.

Staging Site indexing

So you got your staging site indexed? Happens to everyone. Here’s a rough guide on fixing it, and suggestions for preventing it. (thought I’d write this up somewhere) The fastest way to get the staging site removed from search is remove it via Search Console. For that, you need to verify ownership via Search Console [1] (ironically, this means you’ll likely have to make it accessible to search engines again, or figure out DNS verification, which isn’t that common but also not that hard).

robotted resources

I see a bunch of posts about the robotted resources message that we’re sending out. I haven’t had time to go through & review them all (so include URLs if you can :)), but I’ll spend some time double-checking the reports tomorrow. Looking back a lot of years, blocking CSS & JS is something that used to make sense when search engines weren’t that smart, and ended up indexing & ranking those files in search.


I noticed there’s a bit of confusion on how to tweak a complex robots.txt file (aka longer than two lines :)). We have awesome documentation (of course :)), but let me pick out some of the parts that are commonly asked about: Disallowing crawling doesn’t block indexing of the URLs. This is pretty widely known, but worth repeating. More-specific user-agent sections replace less-specific ones. If you have a section with “user-agent: *” and one with “user-agent: googlebot”, then Googlebot will only follow the Googlebot-specific section.