I noticed there’s a bit of confusion on how to tweak a complex robots.txt file (aka longer than two lines :)). We have awesome documentation (of course :)), but let me pick out some of the parts that are commonly asked about: Disallowing crawling doesn’t block indexing of the URLs. This is pretty widely known, but worth repeating. More-specific user-agent sections replace less-specific ones. If you have a section with “user-agent: *” and one with “user-agent: googlebot”, then Googlebot will only follow the Googlebot-specific section.
For those of you using hreflang for international pages: Make sure any rel=canonical you specify matches one of the URLs you use for the hreflang pairs. If the specified canonical URL is not a part of the hreflang pairs, then the hreflang markup will be ignored. In this case, if you want to use the “?op=1” URLs as canonicals, then the hreflang URLs should refer to those URLs. Alternately, if you want to use just “/page”, then the rel=canonical should refer to that too.
Sometimes it’s a hassle to track auth data for the Google Spreadsheet API. Here’s a quick hack using Google Forms to post data to a Spreadsheet (similar to the previous post that uses Curl). You can use it as a function in your code, or as a simple command-line tool. Gist (archive.org) #!/usr/bin/python """Posts to a Google Sheet using a Form""" import re import sys import urllib import urllib2 def get_field_ids(form_url): """Returns list of field IDs on the form.
Adding content to a Google Spreadsheet usually requires using the Spreadsheet API (archive.org), getting auth tokens, and tearing out 42 pieces of hair or more. If you just want to use Google Spreadsheets to log some information for you (append-only), a simple solution is to use a Google Form (archive.org) to submit the data. To do that, you just need to POST data using the field names, and you’re done. The data is stored in your spreadsheet, you even get a timestamp for free.
Sometime you don’t need to host code, you just want to post it in a blog post. Google Code Prettify (archive.org) does this really well, either per post, or across the blog. 1. Copy the script tag. Here’s what you need to copy into either your template, or into your post: <script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js" ></script> 2. HTML-encode your code There are a bunch of HTML encoders (archive.org) online, I haven’t found one that I’m really a fan of.
Here’s one for fans of the hypertext HTTP protocol – should I use 429 or 503 when the server is overloaded? It used the be that we’d only see 503 as a temporary issue, but nowadays we treat them both about the same. We see both as a temporary issue, and tend to slow down crawling if we see a bunch of them. If they persist for longer and don’t look like temporary problems anymore, we tend to start dropping those URLs from our index (until we can recrawl them normally again).
It feels like it’s time to reshare this again. There still is no inherent ranking advantage to using the new TLDs. They can perform well in search, just like any other TLD can perform well in search. They give you an opportunity to pick a name that better matches your web-presence. If you see posts claiming that early data suggests they’re doing well, keep in mind that’s this is not due to any artificial advantage in search: you can make a fantastic website that performs well in search on any TLD.
I’ve been involved since we first started testing authorship markup and displaying it in search results. We’ve gotten lots of useful feedback from all kinds of webmasters and users, and we’ve tweaked, updated, and honed recognition and displaying of authorship information. Unfortunately, we’ve also observed that this information isn’t as useful to our users as we’d hoped, and can even distract from those results. With this in mind, we’ve made the difficult decision to stop showing authorship in search results.
One person’s time savings (archive.org) result in another persons cursing.