crawling

Bots that impersonate Googlebot

Anyone can act like a bot just by using the Googlebot useragent in a request. Sometimes crawlers do that to see what other bots might see. Sometimes it’s to circumvent robots.txt directives that apply to them, but not to Googlebot. Sometimes people hope to get a glimpse at cloaking. Whatever the reason, these kinds of requests can be annoying since they make log file analysis much harder. Motivation for this excursion:

Crawl a website to get a list of all URLs

Sometimes you just need a list of URLs for tracking. There exist loud & fancy tools to crawl your site, but you can also just use wget. wget --mirror --delete-after --no-directories http://your.website.com 2>&1 | grep '^--' | awk '{print $3}' | sort >urls.txt or, if you just want the URL paths (instead of http://example.com/path/filename.htm just /path/filename.htm): wget --mirror --delete-after --no-directories http://your.website.com 2>&1 | grep '^--' | awk '{print $3}' | sort | sed 's/^.

crawl budget & 404s

I have a large site and removed lots of irrelevant pages for good. Should I return 404 or 410? What’s better for my “crawl budget”? (more from the depths of my inbox) The 410 (“Gone”) HTTP result code is a clearer sign that these pages are gone for good, and generally Google will drop those pages from the index a tiny bit faster. However, 404 vs 410 doesn’t affect the recrawl rate: we’ll still occasionally check to see if these pages are still gone, especially when we spot a new link to them.

307s

HTTPS & HSTS: 301, 302, or 307? If the combination of these letters & numbers mean anything to you, you might be curious to know why Chrome shows you a 307 redirect for HSTS pages. In the end, it’s pretty easy. After seeing the HTTPS URL with the HSTS header (for example, with any redirect from the HTTP version), Chrome will act like it’s seeing a 307 redirect the next time you try to access the HTTP page.

A search-engine guide to 301, 302, 307, & other redirects

It’s useful to understand the differences between the common kinds of redirects, so that you know where to use them (and can recognize when they’re used incorrectly). Luckily, when it comes to Google, we’re pretty tolerant of mistakes, so don’t worry too much :). In general, a redirect is between two pages, here called R & S (it also works for pages called https://example.com/filename.asp , or pretty much any URL). Very simplified, when you call up page R, it tells you that the content is at S, and when it comes to browsers, they show the content of S right away.

Soft-404s & your site

We call a URL a soft-404 when it is essentially a placeholder for URLs that no longer exist, but doesn’t return 404. Using soft-404s instead of real 404s is a bad practice, and it makes things harder for our algorithms – and often confuses users too. We’ve been talking about soft-404s since “forever,” here’s a post from 2008: http://googlewebmastercentral.blogspot.com/2008/08/farewell-to-soft-404s.html for example. In 2010 we added information about soft-404s to Webmaster Tools ( http://googlewebmastercentral.

robotted resources

I see a bunch of posts about the robotted resources message that we’re sending out. I haven’t had time to go through & review them all (so include URLs if you can :)), but I’ll spend some time double-checking the reports tomorrow. Looking back a lot of years, blocking CSS & JS is something that used to make sense when search engines weren’t that smart, and ended up indexing & ranking those files in search.

robots.txt

I noticed there’s a bit of confusion on how to tweak a complex robots.txt file (aka longer than two lines :)). We have awesome documentation (of course :)), but let me pick out some of the parts that are commonly asked about: Disallowing crawling doesn’t block indexing of the URLs. This is pretty widely known, but worth repeating. More-specific user-agent sections replace less-specific ones. If you have a section with “user-agent: *” and one with “user-agent: googlebot”, then Googlebot will only follow the Googlebot-specific section.

429 or 503

Here’s one for fans of the hypertext HTTP protocol – should I use 429 or 503 when the server is overloaded? It used the be that we’d only see 503 as a temporary issue, but nowadays we treat them both about the same. We see both as a temporary issue, and tend to slow down crawling if we see a bunch of them. If they persist for longer and don’t look like temporary problems anymore, we tend to start dropping those URLs from our index (until we can recrawl them normally again).

503s

Dear webmasters, if something goes drastically wrong with your hoster, and you can’t host your website anymore, please return a “503 Service unavailable” HTTP result code. Doing so helps search engines to understand what’s up – they’re generally more than happy to give your site some time to catch up again. Returning an error page with “200 OK” will result in us indexing the change of content like that (and if all of your pages return the same error page, then we may assume that these URLs are duplicates).