Crawl a website to get a list of all URLs

Sometimes you just need a list of URLs for tracking. There exist loud & fancy tools to crawl your site, but you can also just use wget.

wget --mirror --delete-after --no-directories http://your.website.com 2>&1 | grep '^--' | awk '{print $3}' | sort >urls.txt

or, if you just want the URL paths (instead of http://example.com/path/filename.htm just /path/filename.htm):

wget --mirror --delete-after --no-directories http://your.website.com 2>&1 | grep '^--' | awk '{print $3}' | sort | sed 's/^.*\/\/[^\/]*\//\//' >urls.txt

(Gist)

Collecting a list of path / file URLs makes it easier to compare staging with live sites. For example you could use this to see the difference in crawlable URLs:

wget --mirror --delete-after --no-directories http://your.website.com 2>&1 | grep '^--' | awk '{print $3}' | sort | sed 's/^.*\/\/[^\/]*\//\//' >urls-live.txt

wget --mirror --delete-after --no-directories http://staging.website.com 2>&1 | grep '^--' | awk '{print $3}' | sort | sed 's/^.*\/\/[^\/]*\//\//' >urls-staging.txt

diff urls-live.txt urls-staging.txt

Comments / questions

There's currently no commenting functionality here. If you'd like to comment, please use Twitter and @me there. Thanks!

Tweet about this - and/or - search for latest comments / top comments

Related pages