Google Sitemaps How to Part 1: What is It?

Google invented Google Sitemap Protocol (GSP) to help webmasters and publishers get their sites indexed more accurately by the biggest search engine on the planet. This article is part of a series of article intended to show you how to create Google XML Sitemaps — some would call it RSS Sitemap — for free.

What Google Sitemaps exactly is? According to Google:

Google Sitemaps is an experiment in web crawling. Using Sitemaps to inform and direct our crawlers, we hope to expand our coverage of the web and and speed up the discovery and addition of pages to our index. By placing a Sitemap-formatted file on your web server, you enable our crawlers to find out what pages are present and which have recently changed, and to crawl your site accordingly.

If you have Google spider come to your web site on a regular basis, ignore this article. But if your site is content rich, consists of quality content or dynamic web pages that change often and Google fail to crawl them fully, you may try Google Sitemaps.

Google Sitemaps use the Sitemap Protocol, which is based on XML for summarizing information for Google web crawler. In this XML file, you can add information such as last modified date of the page and approximate change frequency.

Many marketers hype up the technology by saying this method ensures Google spider crawls your web pages fully and quickly. They are simply wrong. It’s true, Google created this for a purpose but as stated in the Frequently Asked Questions (FAQ), Google doesn’t by any mean guarantee that all of your pages will be crawled once you put them into your Sitemap.

Of course, it is absolutely worth the effort to try this. It’s been working for some publishers and it might work for you too. Moreover, there’s nothing to lose since both personal and commercial use is free anyway.

Basically there are three steps required to create and maintain Google Sitemaps:

  1. Creating a Sitemap in a supported format
  2. Submitting the Sitemap file to Google
  3. Updating the Sitemap file when your site changes

Besides the Sitemap Protocol XML file format, Google supports three other formats:

  • OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) - This is an application-independent interoprability framework based on metadata harvesting. If your site doesn’t use this protocol then forget about what it is.
  • Syndication feed - If your site is operated using weblog software like WordPress or MovableType for content management, Google accepts RSS (Really Simple Syndication) 2.0 and Atom 0.3 feeds. The drawback of using this format is that you only can list a few URLs which are the most recent web pages (posts).
  • Text file - List one URL per line and save as plain text file. This format is acceptable, although not recommended. Google seems to give lower priority to text based Sitemaps compared to XML formatted Sitemaps. You can generate XML Sitemap from a text sitemap file properly by using Sitemap Generator.

That’s it for now. I hope you get a basic understanding about this new technology. In the next part, I’ll walk you through the process of manually creating a valid XML Sitemap file using text editor. If you have small site with less than a hundred web pages, this come handy without the need to install any program or script on your web server.

Google Sitemaps (BETA) FAQ.
Google Sitemap Protocol.

Wayback Machine Sued for Copyright Infringement

The New York Times is reporting that Internet Archives is being sued. The Wayback Machine is used by many of us for many reasons. Every one can see how a web site evolves over time.

I’ve also known that my Marketing Loop domain was an expired domain owned by other marketers after checking with the free service.

The Internet Archive was created in 1996 as the institutional memory of the online world, storing snapshots of ever-changing Web sites and collecting other multimedia artifacts. Now the nonprofit archive is on the defensive in a legal case that represents a strange turn in the debate over copyrights in the digital age.

One law firm realize that it is also used as a legal tool to turn up old web pages that can be used in legal case. To prevent this from happening, they decided to bring the Wayback Machine into lawsuit.

This lawsuit has an incredible number of implications; to name a few:

  1. The fun aspect of bringing up old versions of Web sites
  2. A sure fire way to prove copyright infringement
  3. Many search engines have “caching” functionality

Via Search Engine Journal, read the original article at New York Times.

Small Business and Blog Growth Increase Domain Registration

According to Netcraft, June 2005 is the second largest monthly increase in the history of hostname registrations, reaching 2.76 million. The larger gain was a 3.3 million hostname increase in March 2003, which ended months of stagnation and kicked off 30 conscutive months of positive growth for the Web.

Factors in the dramatic growth include:

  • Increasing use of the Internet by small businesses as web sites and online storefronts become more affordable.
  • The explosive growth of weblogs, a growing number of which are purchasing domains for branding purposes.
  • Speculation in the market for domain names, buoyed by rising resale prices and the ability to generate revenue via pay-per-click advertising on parked domains.
  • Strong sales of online advertising, especially keyword-based contextual ads that support business models for both domain parking and commercial weblogs.

Pay special attention to the last point. Content publishers jump on the opportunity to generate revenue through content by leveraging keyword-based contextual ads like Google AdSense.

Read Netcraft’s July 2005 Web Server Survey.

Google Faces Legal Action Click Fraud

Google is facing legal action over allegations that it failed to protect users of its AdSense advertising technology from click fraud, Simon Aughton reported for PC Pro UK.

Click Defense, a retailer of online marketing tools, claims that it has lost at least $5mn to click fraud, an industry term for the practice of deliberately clicking on Web advertisements to artificially inflate the figure for the number of times an ad is clicked. Because Google charges advertisers on a per-click basis, the practice can prove very costly. In some cases companies have even employed people or computers to click on rivals’ ads.

Advertisers pay a set amount, which can in extreme cases be as high as $95 per click. Their ads are run alongside standard Google search results, when the search keywords match those that the advertiser has allocated to their ad.

Google relies almost exclusively on ad revenue and rejected Click Defense’s claim.

Google and its main competitor won’t disclose the extend of click fraud but it is estimated to be 20% of total ad clicks. Clearly click fraud pose a major concern for both advertisers and publishers, and of course Google.

Read the full news at PC Pro.

MSN Switches to RankNet System, Adds New Search Commands

MSN has introduced a technology called RankNet based on neural net technology to its ranking system, claims to focus on improving relevance of the web results.

Danny pointed out an MSN research paper that Gary found about the neural net technology.

As explained in MSN Search’s Weblog:

The ranker we released in February served us well, but had some flaws that we weren’t happy about. In collaboration with Chris Burges and other friends from Microsoft Research, we now have a brand new ranker. The new ranker has improved our relevance and perhaps most importantly gives us a platform we think we can move forward on quicker than before.

At the same time, MSN also added new commands to help users and SEOs search:

  • inanchor: - search all the text in anchors.
  • filetype: - search by filetype
  • inurl: - find pages that have text within the URL
  • intitle: - search the title of sites
  • linkdomain: - replaces link: and does a pretty good job of finding all incoming links to the site
  • contains: - find pages with links to documents of a particular type

The SEW Forums has an interesting discussion about results for this new technology and what people are seeing happen.

Via Search Engine Roundtable, link to MSN Search’s weblog entry.