Converting ancient content into markdown

So you want to get that ancient blog post or article that you put up and convert it into Markdown? This is what has worked for me. YMMV.

Find the content

Crawl and mirror the site to get everything

This is kinda easy, and makes it possible for you to get all of the content that’s linked locally. It doesn’t make viewing them in a browser easy though. This uses wget. Having wget makes downloading content much easier too, so if you don’t have that installed, then go grab it.

wget --mirror --convert-links --adjust-extension --page-requisites \ 
  --convert-links http://yourancientsite.com/

The command line options here do:

  • –mirror: follow the links
  • –convert-links: converts the links in the HTML files to match your downloaded versions, so that you can view the pages locally
  • –adjust-extension: when a HTML page is downloaded, make it a .html file. Some sites use funky parameters or other extensions, this makes the files viewable locally.
  • –page-requisites: Also includes things like CSS, images, etc, so that you can view the whole page locally

Here’s the full wget manual.

You can then zip up the files that you get and now you have an archive. Obviously, large sites = lots of content. This is particularly useful for working offline or when your ancient site is ancient slow.

Crawl to get just the URLs

I like to use Screaming Frog, a tool that crawls your site (mostly since my own crawler is kinda defunct now). For small sites, or for people who have a license for the tool already, that might be fine. For everything else, of course there are command lines. #PoorMansScreamingFrog

Using wget again:

wget --mirror --delete-after --no-directories http://yoursite.com \
   2>&1 | grep '^--' | awk '{print $3}' | sort >urls.txt

The options here:

  • –mirror: as before
  • –delete-after: don’t keep the downloaded files (we just want the URLs)
  • –no-directories: don’t create the directory structure (again, we just want URLs)

The funky stuff afterwards (grep, awk, print, sort) just filter the output (grep), pick out the URL (awk, print), and then sort the resulting list. The list of URLs is saved as “urls.txt”.

Search for traces

Realizing that a bunch of the early sites made weren’t actually linked well, I couldn’t just click through them to get it all. Haha. Sometimes the content is indexed in Google, so you can try a [site:yoursite.com].

Convert into Markdown

Unfortunately you can’t just convert a whole HTML page into Markdown, since it usually includes various cruft like headers, sidebar, footers, etc.

A simple solution for individual pages is to use a browser plugin that lets you select the part of a page which you care about and to just convert that. I used MarkDownload for Chrome, but I’m sure there are others.

Warning: since these extensions request access to all content on all pages, they can steal all your data without warning. They can do this within seconds of being installed, so just installing them briefly and removing them again won’t work. Instead, create a separate browser profile where you’re not logged in, and use that.

Comments / questions

There's currently no commenting functionality here. If you'd like to comment, please use Twitter and @me there. Thanks!

Tweet about this - and/or - search for latest comments / top comments

Related pages