Google Mini Remove URL from index

One of my ASP.NET ecommerce applications uses URL rewriting for product pages. For example:

Item Sku Number: 473-151
Product Description: Bright Products, Black box converter
Website URL: href=http://www.mydomain.com/Products/473-151- Bright Products, Black box converter

Note: Text after the SKU item number is irrelevant as it is disregarded for the purposes of the ASP.NET engine, only the 473-151 finds the page.

Google Mini

We use a Google Mini to index the site and provide search results to the site users. As the product page can be entered from a number of differing routes in the past using different access URLs, and to help keep the page count down in the results from the box we use the canonical header meta tag to provide what should be the definitive page url for this page.

Canonical headers are supported by all the search engines of importance. The tag looks like this;

Change of description

Recently a supplier complained as the description of the product in the URL for the item was wrong, although on the page it was correct. After investigation it was found that the item description had been changed as the supplier had rebranded the brand name, see below.

Item Sku Number: 473-151
Product Description: Mighty Products, Black box converter
Website URL: href=http://www.mydomain.com/Products/473-151- Bright Products, Black box converter

This meant that when the item was searched for in the Google mini, it found “Black box converter” but had the incorrect URL shown above. It should have the url as follows;

Website URL: href=http://www.mydomain.com/Products/473-151- Mighty Products, Black box converter

What's wrong?

So what is wrong? It turns out that the Google Mini still has the old URL in the index. In fact it turns out that the page is very persistent at staying in the index. Thus the box happily crawls it each night.

It seems “the index” is a list of pages the Google Mini has found at some time in the past. In fact a page can now have been “unlinked” from the site, having no inbound links to it, but it will still persist in the index and thus results.

The only way to remove a page from the Google Mini Index is highlighted in this document Administering Crawl for Web and File Share Content: Introduction, here it sates that;

  • The license limit is exceeded
  • The crawl pattern is changed
  • The robots.txt file is changed
  • Document is not found (404)

These are the only ways that a page will be removed. As in this scenario, the page still returns a valid page, as it has the same item SKU number, it keeps indexing under the wrong URL potentially forever!

Also it is worth noting that if you are having problems with re-indexing the content of the page rather than the URL of the page then check the “Last-Modified” header that is being returned by the page in the response from the web server. This is particularly an issue in dynamic pages as normally static pages will be dealt with appropriately from the last modified date of the file on the file system of the site. You can study the headers from the page by using a developer tool bar (now built into IE8).

Solution attempt 1

Aha I thought I know how to tackle this. The old URL no longer exists now, as it has been superseded by the new page, thus the ASP.NET site should be issuing a response.status = “301 moved permanently” to force the page out of the Google Mini to index the new page page and register that URL and presumably drop the old URL from the index.

Couple of lines and problem was solved I thought.


    If Not officialUriForPage.PathAndQuery.EndsWith(Request.RawUrl) Then
      Response.Clear()
      Response.Status = "301 Moved Permanently"
      Response.AddHeader("Location", utility.GetPublicProductURL( _
                Me.ProductDetails.ProductId, Me.ProductDetails.ItemDescription))
      Response.End()
    End If

So now the old page will issue a “301 moved permanently” response to the browser and Google Mini, it will go index that new page and drop the old URL – However it don’t work that way.

Solution attempt 2

After the overnight index solution 1 turned turned out a failure. Reading the documentation again it turns out the Goole Mini is being helpful and returning both URL’s, the new and moved URL, for any searches that have a search hit inside the new URL content. It seems that the four methods of removal noted earlier really are the only way to remove a page from the index.

Action

I could put the URL I wanted to remove from the index into the “Don’t Crawl URLS” box of the crawl pattern definition in the Google Crawl admin pages. This would then cause the Google Mini to, after 15 minutes to six hours, re-examine the index and realise this page no longer should be there and remove it. This would be done under the criteria “The crawl pattern is changed”, item two of the list of conditions for removal of pages in the list earlier. I could then remove the don’t crawl URL again from the Google Box so I don’t forget and accidentally block a future new URL replacement. This should work for a few pages, we have about 15,000 products online, need something better.

Instead I went for the last option in the list, “If the search appliance receives a 404 (Document not found) error from the Web server when attempting to fetch a document, the document is removed from the index.”.

Hence I changed the code sample above to redirect to our generic 404 not found page rather than redirecting with the moved redirect. Check that the 404 page responds in the header with a 404 status code or the Google Mini will not see the 404 status. However I don’t want this to happen for end users only the Google Mini. This is because for an end user they just want to be redirected to the URL, a 404 not found is rude and would make lost sales as users assume the item no longer exists. Luckily the Goole box sends a configurable user_agent variable in requests, so we can behave differently to it.

    If Not IsNothing(System.Web.HttpContext.Current.Request. _
             ServerVariables("HTTP_USER_AGENT")) _
        AndAlso System.Web.HttpContext.Current.Request. _
             ServerVariables("HTTP_USER_AGENT").Contains("gsa-crawler") Then
        'Not found for Google Mini
        Response.Clear()
        Response.Status = "404 Not Found"
        Response.AddHeader("Location", "/ErrorPages/404.aspx")
        Response.End()
    Else
        'Perm redirect
        Response.Clear()
        Response.Status = "301 Moved Permanently"
        Response.AddHeader("Location", common.utility.GetPublicProductURL( _
                    Me.ProductDetails.ProductId, Me.ProductDetails.ItemDescription))
        Response.End()
    End If

I hope the problem is now resolved.