wget

Mirroring a website for use on static hosting

If you’ve been following along, or probably not, since none of this is live yet, I’ve been moving some of my random old sites over to static hosting to simplify life. Static hosting doesn’t solve everything, and doesn’t protect your cheese, but it’s cheap & carefree (at least, until your hoster deprecates their static hosting). Finding new places to put static hosting is pretty straightforward too. I use Firebase static hosting for this site at the moment, that’s what this post covers.

Converting ancient content into markdown

So you want to get that ancient blog post or article that you put up and convert it into Markdown? This is what has worked for me. YMMV. Find the content Crawl and mirror the site to get everything This is kinda easy, and makes it possible for you to get all of the content that’s linked locally. It doesn’t make viewing them in a browser easy though. This uses wget.

Crawl a website to get a list of all URLs

Sometimes you just need a list of URLs for tracking. There exist loud & fancy tools to crawl your site, but you can also just use wget. wget --mirror --delete-after --no-directories http://your.website.com 2>&1 | grep '^--' | awk '{print $3}' | sort >urls.txt or, if you just want the URL paths (instead of http://example.com/path/filename.htm just /path/filename.htm): wget --mirror --delete-after --no-directories http://your.website.com 2>&1 | grep '^--' | awk '{print $3}' | sort | sed 's/^.

Command lines for the weirdest things

Just a collection of command line tweaks. These work on Linux, probably mostly on MacOS too (and who knows, with things for Windows probably). (content irrelevant, just adding an image) Basics General command-line tips Todo: find some other sources Pipe tricks: | more - show output paginated (use space to get to next page) | less - similar to more, but with scroll up/down | sort - sort the output lines alphabetically | uniq - only unique lines (needs β€œsort” first) | uniq -c - only unique lines + include the number of times each line was found (needs sort first) | sort -nr - sort numerically (β€œn”) and reverse order (β€œr”, highest number first) | wc -l - count number of lines Searching for things (more grep options)

Hackers stealing your PageRank

The last time I wrote about a hacked site, it was using a redirect that sent some users to a different site. This kind of hack is pretty common (even though it’s usually not as complex as mentioned in that post), it leverages the sad fact (archive.org) that users are often easy to trick and not browsing with protection (or a current browser (archive.org)). A different angle of attack is to redirect only search engine crawlers to a different site.

Confirm that you're using Analytics on all pages

Here’s something from my mailbox - someone wanted to know how he could crawl his site and confirm that all of his pages really have the Google Analytics (archive.org) tracking-code on them. WordPress users have it easy, there are plugins that handle it automatically. Sometimes it’s worth asking nicely :) - let me show you how I did it. As a bonus, I’ll also show how you can check the AdSense ID on your pages, if you’re worried that you copy/pasted it incorrectly.

The website hack you'd never find

Warning: do not try the URLs here unless your system is locked down properly. I suggest using a “virual machine” (I use VMware) to test things like this. The hack itself is complicated, the system is simple - skip the complicated part if you’re in a hurry. It all started with a posting (archive.org) like this: When I do a google search for [Jonathan Wentworth Associates] the first result is: _Jonathan Wentworth Associates, LTD Welcome to Jonathan Wentworth Associates, a respected resource for world-class orchestral soloists, conductors, opera, chamber music, chamber orchestras, .

Check your web pages for hacks and unauthorized changes

Websites have become popular targets for hackers, who either try to add elements that automatically download “malware” (viruses, etc) or try to add hidden links (SEO hacking) to other websites. Quite often, these kinds of changes are not recognized by the webmaster or website owner. You could wait until a visitor complains to you or you receive a mail from Google for spreading malware (or having hidden links to “bad places”), but that is slow, unreliable and usually too late.