A set of command-line Windows website tools

Posted on 30 August 2007 at 22:51 UTC, filed under Tricks, disclaimer

If you have to do things over and over again, it’s a good idea to use a tool to make things easier. Windows is a bit limited (or very – when compared to Linux) when it comes to batch file scripts and “wget” is limited to what it can do right out the box, so I sat down and wrote a few command line tools to help me with some of the website checks that I like to do.

The tools I included in this set can do the following:

  • Check the result codes for a URL (and follow in the case of a redirect) – or for a list of URLs
  • Create a list of the links found on a URL (or just particular ones)
  • Create a list of the links and anchor texts found on a URL (or just particular ones)
  • Create a simple keyword analysis of the indexable content on a URL


You can get the down from here (requires the Windows .NET runtime v1.1):

WebResult

This tool accesses a URL and shows the result code that was returned. If the status is a redirect, it will display the redirection location and optionally follow it to check the final result code. It may be used with a list of URLs. The output is tab-delimited.

Usage:
WebResult [options] (URL|urllist.txt)
Options:
–referer|-r [referrer] (default: none)
–user-agent|-u [user-agent] (default: “WebResult”)
–follow-redirect|-f (default: not)
–headers|-h (displays the full response headers)
–verbose|-v

Example:
Check for correct canonical redirect:
Webresult http://johnmu.com/
Webresult http://www.johnmu.com/

WebLinks

This tool lists the links that are found on a URL. Note that it has an integrated HTML/XHTML parser – if the code on the page is not fully compliant, there is a chance of the parser not recognizing all links (it is fairly fail-safe, though).

This tool can use a cached version of the URL (from either this tool or one of the other ones) to save bandwidth. The cached versions are saved in the user’s temp-folder.

You have the choice of only listing domain outbound or insite links (to help simplify the output). Additionally links with the HTML microformat “rel=nofollow” may be marked as such. The output is in alphabetical order.

Usage:
WebLinks [options] (URL|urllist.txt)
Options:
–referer [referrer] (default: none)
–user-agent [user-agent] (default: “WebLinks”
–insite-only|-i (default: both in + out)
–outbound-only|-o (default: both in + out)
–ignore-nofollow|-n (default: off)
–cache|-c (default: off)
–verbose|-v (default: off)

Example:
Check the outbound links on a site.
WebLinks -o http://johnmu.com/

WebAnchors

This tool lists the links and anchor text as found on a URL. It uses the same HTML/XHTML parser as WebLinks. It can be used to find certain links (based on the URL, domain name, URL-snippets, or even parts of the anchor text). If the anchor for a link is an image, it will use the appropriate ALT-text, etc.

Usage:
WebAnchors [options] (URL|urllist.txt)
Options:
–referer|-r [referrer] (default: none)
–user-agent|-u [user-agent] (default: “WebLinks”
–find-url|-f http://URL
–find-domain|-d DOMAIN.TLD
–find-anchor|-a TEXT
–find-url-snippet|-s TEXT
–url-only|-o (default: show anchor text as well)
–skip-nofollow|-n (default: off)
–cache|-c (default: off)
–verbose|-v (default: off)

Example:
Check the links with “Google” in the anchor text.
WebAnchors -s “Google” http://johnmu.com/

WebKeywords

This tool does a simple keyword analysis on the indexable content of a URL. It also uses the above HTML/XHTML parser to extract the indexable text. It is possible to get single-word keywords or to use multi-word-phrases. The output is tab-delimited for re-use.

Usage:
WebKeywords [options] (URL|urllist.txt)
Options:
–referer|-r [referrer] (default: none)
–user-agent|-u [user-agent] (default: “WebLinks”
–verbose|-v (default: off)
–words|-w [NUM] (phrases with number of words, default: 1)
–ignore-numbers|-n (default: off)
–cache|-c (cache web page, default: off)

Example:
Extract 3-word keyphrases from a page:
Webkeywords -w 3 http://johnmu.com/

Combined usage of these tools

Find common keyphrases on sites linked from a page (uses a temporary file to store the URLs):

webanchors -c -o -a “Google” http://johnmu.com >temp.txt
webkeywords -c -w 3 temp.txt

Check result codes of all URLs linked from a page:

weblinks -c http://johnmu.com >temp.txt
webresult temp.txt >links.tsv

Compare result codes for multiple accesses:

echo. >results.tsv
for /L %i IN (1,1,100) DO webresult http://johnmu.com/ >>results.tsv

or more complicated to test a hack based on the referrer (all on one line):

for /L %i IN (1,1,100) DO webresult -u “Mozilla/5.0 (Windows; U) Gecko/20070725 Firefox/2.0.0.6″ -r http://www.google.com/search?q=johnmu http://johnmu.com/ >>results.tsv

I’d love to hear about your usage of these tools :) .

There is one trackback ping to this post.
  1. […] JohnMu’s WebToolbox […]

There are 11 comments to this post.
  1. The download link appears to be broken. WebResult looks like just what I was looking for, so I’d really like to try it out.

  2. Oh no it’s gone! Still giving this one away? I’d love to get my hands on it.

  3. Link fixed. Thanks :)

  4. Cool little app John. I have a question for you.

    Let’s say I have a site and want to load a list of a dozen or so urls into WebAnchors to see which page of my site they’re linking to and with what anchor text. Is there a way to get the results to prepend the originating site info in my results.txt file?

    For instance, if I use WebAnchors -d MySite.com urls.txt>results.txt I get output like:

    http://www.MySite,com/somepage.php Anchor Text

    What I’m trying to do is tie these back to the originating site to get output something like

    http://www.OriginatingSite,com/linkingpage.html http://www.MySite,com/somepage.php Anchor Text

    Is there a way to append or prepend this origintating site data via a trigger I’m not seeing?

  5. That’s a good idea, Randy. I’ll see what I can do :-)

  6. Woot! It’s nice to have a good idea every now and then. ;-)

  7. Did originating site data prove to be something that cannot be done for now John? Or did you simply get too busy with real work to tinker lately like I have?

    Just wanted to know if I should keep checking back on this post.

    Randy

  8. Hi John, this might be a dumb question but would you ever consider providing the source for these tools?

  9. Nice tools, useful as starting point to realize an application.

  10. Hi

    Your tool is great, but it does not handle https URIs.
    webresult --headers https://opera.intranatixis.com
    http://https://opera.intranatixis.com

    It could be great if it could :-)

  11. Hi,

    How can I use ‘WebResult’ for HTTPS sites?

    Cheers
    Andre’

Feel free to leave a reply to this posting.

Warning! Your comment will be lost if you mistype the spam-test or forget to enter your name or e-mail-address. Copy your comment to the clipboard to be sure.

You may use these tags within your reply: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>