Mirroring a website for use on static hosting

If you’ve been following along, or probably not, since none of this is live yet, I’ve been moving some of my random old sites over to static hosting to simplify life. Static hosting doesn’t solve everything, and doesn’t protect your cheese, but it’s cheap & carefree (at least, until your hoster deprecates their static hosting). Finding new places to put static hosting is pretty straightforward too.

I use Firebase static hosting for this site at the moment, that’s what this post covers. YMMV, I’m sure there are easier, better, or even proper ways to do this, I don’t care - this works for me, so I’m writing it down.

Note: any dynamic URLs (with ? in them) won’t work on static hosting. The “script” URL can show something, but it won’t be dependent on the ?-part. For example, if you have http://example.com/forum.php?t=123 then you can show something under http://example.com/forum.php, but not something different under http://example.com/forum.php?t=123. If you need content behind a “?”, you need either fancier static hosting (that allows pages with “?” in them), or a more traditional hoster after all.

My process in overview is:

  1. Download the full site’s source-code for safe-keeping. You could FTP in, or download a backup, whatever.
  2. Mirror the visible part of the website (this is the basis of your static hosting)
  3. Mirror the disconnected parts (some sites just don’t have good internal links)
  4. Merge & tweak
  5. Upload to Github for safe-keeping
  6. Upload to hoster
  7. Test :-)
  8. Set up new DNS, point registrar at it
  9. Tweak the settings
  10. Be surprised at how many steps it actually is, ugh.

Download, mirror, and dig for more

Downloading the old site depends on your hosting setup. I just download a backup .tar.gz file.

Mirroring the old site and converting the URLs is done with wget. There are lots options, but this seems to work for me:

wget --mirror --convert-links --adjust-extension \
  --page-requisites --no-parent --restrict-file-names=windows \
  http://yourwebsite.com/

The options don’t make a 1:1 copy, so you may need to adjust the files as needed. In particular, URL extensions may vary, and default homepages may differ from your original site.

This creates a subdirectory called, surprisingly, yourwebsite.com.

To make sure I have everything, I check Google. A ‘site:-query’ will tell you +/- the URLs that are indexed, which gives you a sense of whether or not you found everything. It’s basically just ‘site:johnmu.com’ entered into Google (or whatever your domain name is). For example:

https://www.google.com/search?q=site:johnmu.com

If there are parts missing, just use the same wget command as before to get them. Make sure they end up in the same directory.

If you deleted things (images? pages?) archive.org might have a copy, so it’s worth checking there.

Merge & mix

I open both the backup & the mirrored copy in a file explorer, and tweak away. Things often missed by mirroring include:

  • old versions of files not linked (often old images)
  • search engine verification tokens (usually not linked)
  • stupid files you should have deleted long ago but forgot
  • all those images you posted in forums and thought you lost
  • so many embarrassing things you hope nobody ever finds again. Delete delete delete.

If there are non-.html files that you used in the past (whatever.asp, whatever.php), then create a directory called their file name, move the contents into it, and rename it as index.html. This makes /whatever.asp reachable via /whatever.asp/index.html, which most static hosting will resolve for you.

If there are things you want to redirect, just replace the full HTML with a snippet like this (use your own URL, of course):

<html>
<script type="text/javascript">document.location="https://johnmu.com/#newpage";</script>
</html>

As a final step, create a public folder and move all of the static content into it.

Set up on Github

You don’t have to put it on Github, but it’s a nice backup. Use a private repo if you like (I like).

Steps stolen from here:

  1. Create repo with no content (in the UI). Save the repo’s URL (eg, https://github.com/softplus/johannesmueller.com.git – this is the remote URL below)

  2. In the terminal, in the site’s folder, use these commands:

git init -b main # creates local repo
git add . # adds all files
git commit -m "First commit"
git remote add origin  <REMOTE_URL> 
# Sets the new remote
git remote -v
git push origin main # uploads
  1. Check the repo in the web UI

Now it’s on Github, but you haven’t set up the static hosting, so it’s only partially done.

Upload to Firebase hosting

Set up hosting & upload

I use the quick-start guide for Firebase hosting. Warning: depending on your usage, this might cost you money. It’s basically:

  1. Install & log in (only the first time) – see doc.

  2. Set up hosting:

firebase init hosting

Follow the steps there, create a new project (no periods or underscores, ‘-’ dashes are fine), confirm ‘public’ folder, say no to PWAs, no to Github, and no to overwriting your index.html.

  1. Upload
firebase deploy --only hosting

Done!

Confirm it mostly works

Confirm that it works by visiting the URL given in the end.

Update Github for hosting & co:

git add .
git commit -m "hosting"
git push --set-upstream origin main

Crawl and compare

There’s more here, but you can essentially do this:

wget --mirror --delete-after --no-directories http://yourwebsite.com 2>&1 \
  | grep '^--' | awk '{print $3}' | sort | sed 's/^.*\/\/[^\/]*\//\//' >urls-live.txt

wget --mirror --delete-after --no-directories https://yourwebsite.web.app 2>&1 \
  | grep '^--' | awk '{print $3}' | sort | sed 's/^.*\/\/[^\/]*\//\//' >urls-staging.txt

diff urls-live.txt urls-staging.txt

Check that logging is active

I use Firebase Cloud Logging - which has a short setup-guide.

Usually it’s just a matter of clicking “next” a few times from your hosting control panel in Firebase.

Flip the DNS

I’m lazy, again, and just used the Google Clouds DNS setup. It’s probably overkill for my use, but it’s cheap and it’s all in the same place.

Get your server’s IP addresses

When setting up the DNS, you need to get the new server’s IP addresses. With Firebase hosting, I go to the main hosting console page and click on “add custom domain”, which sets things up (info).

It wants to verify the domain, which isn’t possible until we do the next step, but it also gives you the IP addresses.

Set up a new zone

From the main Cloud DNS page, you can go to the console to set up a new zone. It needs a billing account, since DNS is so processing-intense (just kidding, it’s probably cheaper to run than the hosting, but whatever). I put all my zones into the same project, which makes it easier for tracking & setup.

Transferring DNS zones is sometimes quirky, so I just set it up manually. You’ll need:

  • IP addresses of any subdomains hosted
  • Mail exchange (MX) information if you want to send / receive mail
  • Any funky DNS verification tokens you use

My process is:

  1. Get the IP addresses of the server
  2. Open the old DNS setup in a tab
  3. Create the new zone
  4. Copy individual entries
  5. Set up the new A entries for the server
  6. Add any verification tokens needed
  7. Include a fake CNAME (I use “gc.” and CNAME it to example.com) - a simple way to test if it’s live

Once you have the DNS zone set up, you can test to see if it works by explicitly querying for it in a terminal window:

dig yoursite.com @ns-cloud-b1.googledomains.com

Just pick one of the NS entries that was created and ask that server directly. Your NS server name will probably be different than mine.

If you see the “A” records for your site, it’s working. Woot. Celebrate. You can also test the “test” subdomain too. It should show the CNAME entry.

During the testing phase, keep the TTL fairly low (I use the default 300 seconds = 5 minutes), so that you can tweak & check within a reasonable time. After testing, be sure to increase the TTL to the usual 1 day or so.

Update your DNS settings

Now that your new DNS zone is up & running, and that your static site is working, it’s time to switch the DNS setting at the registrar. This will differ by registrar, so that’s left as an exercise for the reader. Haha.

If you’re using Google Cloud DNS, you’ll get 4 DNS servers. You theoretically just need 2 of them, but you might as well specify them all.

The update time will vary. You can check which DNS you currently get on your machine by using dig again. If you’re curious, you can first find an authoritative DNS server for the TLD by using dig com (if it’s a .com domain). Then, specifically query that DNS server for your domain name by using dig yourdomain.com @whatever-tld-server.net.

For example:

$ dig com

(...)
;; AUTHORITY SECTION:
com.			899	IN	SOA	a.gtld-servers.net. nstld.verisign-grs.com. 1618662625 1800 900 604800 86400
(...)

$ dig johannesmueller.com @a.gtld-servers.net

(...)
;; AUTHORITY SECTION:
johannesmueller.com.	172800	IN	NS	ns-cloud-b3.googledomains.com.
johannesmueller.com.	172800	IN	NS	ns-cloud-b1.googledomains.com.
johannesmueller.com.	172800	IN	NS	ns-cloud-b2.googledomains.com.
johannesmueller.com.	172800	IN	NS	ns-cloud-b4.googledomains.com.

;; ADDITIONAL SECTION:
ns-cloud-b3.googledomains.com. 172800 IN AAAA	2001:4860:4802:36::6b
ns-cloud-b3.googledomains.com. 172800 IN A	216.239.36.107
ns-cloud-b1.googledomains.com. 172800 IN AAAA	2001:4860:4802:32::6b
(...)

That initial push usually happens fairly quickly, the update all the way to you will take a while. You can double-check in a browser by just trying the test-subdomain and seeing if it works (or do more dig’ing :-)).

(DNS not live)

(DNS is live)

If you’re under time pressure to try it out, you can force this by editing your local hosts file. It’s annoying and a source of much trouble, so I try to avoid it.

Tweak

Once things are up & running, it’s a good idea to tweak the settings everywhere. In particular, think about:

  • Set DNS TTL for all entries (we’re not using dynamic CDNs, so setting things to 1 hour - 1 day is probably ok; YMMV)
  • Set caching for hosted objects (Lighthouse will let you know)
  • Set appropriate security headers (Lighthouse can help you find them)

Here’s how I’ve set up my firebase.json for these sites:

{
  "hosting": {
    "public": "public",
    "ignore": [
      "firebase.json",
      "**/.*",
      "**/node_modules/**"
    ],
    "headers": [ {
      "source": "**/*.@(eot|otf|ttf|ttc|woff|font.css|woff2)",
      "headers": [ {
        "key": "Access-Control-Allow-Origin",
        "value": "*"
      } ]
    }, {
      "source": "**/*.@(jpg|jpeg|gif|png|webp|ttf|woff|woff2|ico|zip)",
      "headers": [ {
        "key": "Cache-Control",
        "value": "max-age=604800"
      } ]
    }, {
      "source": "**/*.@(css|js|json)",
      "headers": [ {
        "key": "Cache-Control",
        "value": "max-age=604800"
      } ]
    }, {
      "source": "404.html",
      "headers": [ {
        "key": "Cache-Control",
        "value": "max-age=604800"
      } ]
    },
    {
      "source": "**",
      "headers": [ {
        "key": "X-Content-Type-Options",
        "value": "nosniff"
      }, {
        "key": "X-Frame-Options",
        "value": "SAMEORIGIN"
      }, {
        "key": "X-XSS-Protection",
        "value": "1; mode=block"
      }, {
        "key": "Content-Security-Policy",
        "value": "default-src 'self'; child-src 'none'; script-src https://cdnjs.cloudflare.com 'unsafe-inline'; style-src 'self' 'unsafe-inline'"
      }, {
        "key": "Referrer-Policy", 
        "value": "origin-when-cross-origin"
      }
      ]
    } ]
  }
}

(Gist)

Most of the items are cached for a week (604800 seconds), you could fold them together. Maybe I will, maybe I won’t. In particular, the font files are unlikely to change for a really long time (and if they do change, it won’t break things if users see the old font first).

The CSP (Content-Security-Policy) includes cdnjs.cloudflare.com, where some pages currently get their Jquery files from. Adjust to include any external sources that you use. A simple way to tell is to check your pages in a browser and look at the dev-console for errors.

Done!

Give it a week, check the stats, and then delete the old hosting. Move on. Life’s too short to hold on to unnecessary duplication.

There’s always more, of course. But not in this post.

Comments / questions

There's currently no commenting functionality here. If you'd like to comment, please use Twitter and @me there. Thanks!

Tweet about this - and/or - search for latest comments / top comments

Related pages