Crawl a website to get a list of all URLs

Sometimes you just need a list of URLs for tracking. There exist loud & fancy tools to crawl your site, but you can also just use wget. wget --mirror --delete-after --no-directories 2>&1 | grep '^--' | awk '{print $3}' | sort >urls.txt or, if you just want the URL paths (instead of just /path/filename.htm): wget --mirror --delete-after --no-directories 2>&1 | grep '^--' | awk '{print $3}' | sort | sed 's/^.

MSN/Live and Yahoo! join Google for Sitemaps XML support

Now it’s official: The top three search engines now support the sitemaps format Great going, Vanessa and the sitemaps team!! You’ve done great work since Summer 2005, it’s come a long way. A new standard after a bit more than a year, congratulations! Google: Search engines united ( MSN/Live: Microsoft, Google, Yahoo! Unite to Support Sitemaps ( Yahoo!: Yahoo, Google and Microsoft join forces (really !!) behind Sitemaps ( So, .

Google and their Sitemap

The Google Sitemap system just turned one year old! The “Big Daddy” infrastructure and the “Crawl Caching Proxy” look like they were made to be a perfect match for Google Sitemaps (but it is more likely the other way around). In theory, Google Sitemaps can tell the Google crawlers more about a website, even without having to crawl it. The attributes can be used to help the proxy determine when actual accesses are necessary, keeping bandwidth use on all sides at a minimum.

Results from our Sitemaps study: Site D

Site D is a normal website, with a little startup-funding in form of deep links from several external sites. It does not use Google Sitemaps, nor anything otherwise special. There were 4 links, one to each of the 4 levels, in different parts of the site. The site structure is strictly top-down, with links from the parent to about 10 children and a link to the main URL. There are no cross-links and no links from the children to the parent (just to the main URL).

Getting indexed by Google with Google Sitemaps - in what time

How long would you suggest it will take until a new webpage gets indexed by Google? You might say, this depends. You’re right with that. But you can help yourself getting your webpages indexed better. One approach is to participate with Google Sitemaps - and give Google the urls to add. The people say it takes very long until you see new webpages appearing at the serps. This article describes an example for adding a new article to enarion.

Results from our Sitemaps study: Site B

Site B is a mixture of Site A (Google Sitemaps, no links) and Site C (Adsense, GoogleBar, no Sitemaps, no links). Site B uses Google Sitemaps along with Adsense blocks - and is visited regularly by a virtual visitor using the Microsoft Internet Explorer with the GoogleBar plug-in. Seeing that neither Site A nor Site C were indexed properly with Google, we can only assume that Site B will also not be indexed.

Results from our Sitemaps study: site A

People who are new to the web and want to start with a website usually just put it online and hope that visitors come. With Google Sitemaps the webmaster has a way to let Google know about his site and to try to help Google find all of the pages. I’ll just go through the other sites in the order we had them, site A now, next site B, then site D (we already covered site C) and finally site E.

First result of our sitemaps study

Today we’re going to take a look at one of our sites, and see some of the first results from the test-sites Tobias Kluge and I started. We’re going to take a look at our site “C” - which was set up with the same general content as the other sites, and promoted to Google using only Adsense and a simulated user clicking on the site with an Internet Explorer with the Googlebar installed.

To (Google) sitemap or not to sitemap, that is the question

There are lots of ways to get indexed by Google. Using Google Sitemaps is only one way - the way that seems to be a bit trendy at the moment. “In the beginning” (June / July 2005), when Google had first introduced Google Sitemaps, it was a sure-fire way to get indexed within hours. It really worked. I bet it not only worked for us, but for lots of spammer sites, so Google had to button it down a bit.