Command lines for the weirdest things

Just a collection of command line tweaks. These work on Linux, probably mostly on MacOS too (and who knows, with things for Windows probably).

(content irrelevant, just adding an image)

Basics

General command-line tips

Todo: find some other sources

Pipe tricks:

| more          - show output paginated (use space to get to next page)
| less          - similar to more, but with scroll up/down
| sort          - sort the output lines alphabetically
| uniq          - only unique lines (needs “sort” first)
| uniq -c       - only unique lines + include the number of times each line was found (needs sort first)
| sort -nr      - sort numerically (“n”) and reverse order (“r”, highest number first)
| wc -l         - count number of lines

Searching for things (more grep options)

| grep “something”      - show only lines that contain “something” 
| grep -v “something”   - show only lines that don’t contain “something”
| grep -E “so..me?t.+”  - Use (enhanced) regex to search 
| grep -i “something”   - case-insensitive match

Sampling

| head          - show only first 10 lines
| head -n 20    - show only first 20 lines
| tail -n 20    - show only last 20 lines
| shuf -n 100   - Get 100 random entries from the input

Getting parts of URLs

| grep -oP "[a-z]+://[^\]*/"            - extract just the hostname & protocol from a URL
| grep -oP "[a-z]+://[^/]*/[^/]*/"      - extract just the hostname & protocol + first path part from a URL
| sed 's/^.*\/\/[^\/]*\//\//'           - extract just relative part of URL

Getting parts of inputs

| awk '{print $3}'      - Given a space-delimited list, just show item #3

Math

| awk '{sum+=$1;n++} END {print sum/n}'             - print average of all numbers
| awk '{sum+=$1;n++} END {print sum/n; print n}'    - print average of all numbers + number of entries aggregated

And … saving

> outfile.txt      - save the output in “outfile.txt”

Some combinations

Given a list of items, what are the top 5 most common entries and their counts?

cat input.txt | sort | uniq -c | sort -nr | head -n 5

How many entries of ‘Googlebot’ were there in a log file? (This can be simplified to grep {something} {filename} , but by using cat it’s easier to start by using just a part of the file, eg by adding head in between.)

cat sept.log | grep Googlebot | wc -l

Which requests in my log file from Googlebot resulted in a non-301 redirect? (I just grep for anything that contains “30x” surrounded by spaces, which is usually close enough)

cat sept.log | grep Googlebot | grep -E " 30. " | grep -v " 301 " | more

Top 10 URLs requested in my log file which resulted in non-301 redirect when Googlebot came (Again, grepping for “301” surrounded by spaces)

cat sept.log | grep Googlebot | grep -E " 30. " | grep -v " 301 " | awk '{print $7}' | sort | uniq -c | sort -nr | head -n 10

(This assumes Apache-style logging, with the 7th space-separated entry in the log file being the requested relative URL)

Top 10 smartphone Googlebot IP addresses (based on the useragent)

cat sept.log | grep "Googlebot" | grep "Mobile" | awk '{print $1}' | sort | uniq -c | sort -nr | head -n10

Top sizes returned for your homepage (is it kinda stable or fluctuating?) (Once again, grepping for the HTTP result code surrounded by spaces)

cat sept.log | grep "GET / " | grep " 200 " | awk '{print $10}' | sort | uniq -c | sort -nr | head -n10

Top URLs crawled by Googlebot vs requested by others

cat sept.log | grep Googlebot | grep " 200 " | awk '{print $7}' | sort | uniq -c | sort -nr | head -n 10
cat sept.log | grep -v Googlebot | grep " 200 " | awk '{print $7}' | sort | uniq -c | sort -nr | head -n 10

Top requests of /wp-admin – tsk tsk hackers to block

cat sept.log | grep "/wp-login" | awk '{print $1}' | sort | uniq -c | sort -nr | head

Overall average response size & total “200” requests

cat sept.log | grep " 200 " | awk '{print $10}' | awk '{sum+=$1;n++} END {print sum/n; print n}'
57877.8
324438

Top 10 self-disclosed bots

cat sept.log | awk -F\" '{print $6}' | grep -i bot | sort | uniq -c | sort -nr | head -n10

Top referring domains

cat sept.log | awk -F\" '{print $4}' | grep -oP 'http.?//[^/]*/?'  | sort | uniq -c | sort -nr | head -n10

Fake Googlebot IP addresses (in top 100 Googlebot request IPs)

cat sept.log | grep "Googlebot" | awk '{print $1}' | sort | uniq -c | sort -nr | head -n100 | awk '{print $2}' >ips

while read li; do nslookup $li | grep name | grep -v "googlebot.com.$" | sed -e "s/^/$li is bad: /" ; done <ips

46.229.173.68 is bad: 68.173.229.46.in-addr.arpa	name = bot.semrush.com.
46.229.173.66 is bad: 66.173.229.46.in-addr.arpa	name = bot.semrush.com.
54.149.84.83 is bad: 83.84.149.54.in-addr.arpa	name = ec2-54-149-84-83.us-west-2.compute.amazonaws.com.
46.229.173.67 is bad: 67.173.229.46.in-addr.arpa	name = bot.semrush.com.

Apache server logs common segments

Sample log entries

cat sept.log | shuf -n 5

45.139.48.28 - - [12/Sep/2020:11:31:14 -0400] "GET /firefox-multiplied/ HTTP/1.1" 200 26621 "https://valid-cc.com/" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"
129.204.119.80 - - [28/Sep/2020:20:36:39 -0400] "HEAD /search/www.yumiaowb.cn HTTP/1.1" 200 - "-" "\"Mozilla/5.0 (Linux; U; Android 9; zh-CN; STF-AL10 Build/HUAWEISTF-AL10) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.108 UCBrowser/12.7.6.1056 Mobile Safari/537.36\""
168.62.172.115 - - [30/Sep/2020:22:34:33 -0400] "GET /+?rel=author HTTP/1.1" 301 270 "http://johnmu.com/+?rel=author" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"
35.195.135.67 - - [01/Sep/2020:03:08:15 -0400] "GET /wp-login.php HTTP/1.1" 200 5002 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
173.231.59.195 - - [15/Sep/2020:23:34:36 -0400] "GET /twitter-indexing-peculiarities/ HTTP/1.1" 200 28408 "http://johnmu.com/" "Mozilla/5.0 (compatible; Adsbot/3.1)"

Segments for awk '{print $x}'

  • $1 - IP address
  • $4 - Datestamp (plus the opening [])
  • $6 - Request type (HEAD, GET, POST, etc) (plus opening ")
  • $7 - relative URL
  • $9 - HTTP response code
  • $10 - Response size
  • $11 - HTTP referrer (with quotes)

More parts with awk

  • awk -F\" '{print $4}' - HTTP referrer (without quotes)
  • awk -F\" '{print $6}' - Full user-agent

Alternately with cut

cut -d"{char_delimiter}" -f{column_number}

… where column_number can be one column, or several coma-delimited list. For instance: cut -d";" -f1,5

Crawling

Fetch the contents of that URL (“-s” = silent, don’t show status, etc)

curl -s http://example.com/page

Specify a user-agent for Curl

curl -A SomeUserAgent http://a.com/b

Crawl a site, generate a text file with all full URLs: (see also)

wget --mirror --delete-after --no-directories http://your.website.com 2>&1 | grep '^--' | awk '{print $3}' | sort >urls.txt

Crawl a site, just keep relative URLs: (see also)

wget --mirror --delete-after --no-directories http://your.website.com 2>&1 | grep '^--' | awk '{print $3}' | sort | sed 's/^.*\/\/[^\/]*\//\//' >urls.txt

Crawl a site, generate list of URLs and HTTP status codes (see also)

wget -mirror --delete-after --no-directories http://your.website.com 2>&1 | egrep '(^(--)|\s[0-9]{3,3}\s)' |awk '{print $3"\t"$6}'|egrep '^http|.*[0-9]{3,3}$'| xargs -n2 -d'\n'

Number of URLs in a sitemap file (or number of images)

curl -s http://example.com/sitemap.xml | sed "s/>/>\n/g" | grep "<loc" | wc -l
14
curl -s http://example.com/sitemap.xml | sed "s/>/>\n/g" | grep "<image:loc" | wc -l
0

(The sed part adds a new line at the end of every XML tag – some sitemaps don’t use them, it makes counting a bit harder, so I just add it by default.)

All URLs in a sitemap file (assuming the <loc> entries have the URL on the same line)

curl -s http://example.com/sitemap.xml | grep loc | awk -F '>' '{print $2}' | awk -F '<' '{print $1}'

Extracting all URLs from text sitemaps linked in a sitemap index file

curl -s SITEMAPINDEXURL.XML | grep -op "http.*txt" >si.txt
while read li; do curl -s $li | tee allurls.txt ; done <si.txt

Updates

2021-05-21 - Tweaked based on Quentin_Adt’s tweets


View tweet

Comments / questions

There's currently no commenting functionality here. If you'd like to comment, please use Twitter and @me there. Thanks!

Tweet about this - and/or - search for latest comments / top comments

Related pages