Google Mini Remove URL from index

by Tim 27. August 2009 17:37

One of my ASP.NET ecommerce applications uses URL rewriting for product pages. For example:

Item Sku Number: 473-151
Product Description: Bright Products, Black box converter
Website URL: href=http://www.mydomain.com/Products/473-151- Bright Products, Black box converter

Note: Text after the SKU item number is irrelevant as it is disregarded for the purposes of the ASP.NET engine, only the 473-151 finds the page.

Google Mini

We use a Google Mini to index the site and provide search results to the site users. As the product page can be entered from a number of differing routes in the past using different access URLs, and to help keep the page count down in the results from the box we use the canonical header meta tag to provide what should be the definitive page url for this page.

Canonical headers are supported by all the search engines of importance. The tag looks like this;
<link rel="canonical" href=http://www.mydomain.com/Products/473-151- Bright Products, Black box converter />

Change of description

Recently a supplier complained as the description of the product in the URL for the item was wrong, although on the page it was correct. After investigation it was found that the item description had been changed as the supplier had rebranded the brand name, see below.

Item Sku Number: 473-151
Product Description: Mighty Products, Black box converter
Website URL: href=http://www.mydomain.com/Products/473-151- Bright Products, Black box converter

This meant that when the item was searched for in the Google mini, it found “Black box converter” but had the incorrect URL shown above. It should have the url as follows;

Website URL: href=http://www.mydomain.com/Products/473-151- Mighty Products, Black box converter

What's wrong?

So what is wrong? It turns out that the Google Mini still has the old URL in the index. In fact it turns out that the page is very persistent at staying in the index. Thus the box happily crawls it each night.

It seems “the index” is a list of pages the Google Mini has found at some time in the past. In fact a page can now have been “unlinked” from the site, having no inbound links to it, but it will still persist in the index and thus results.

The only way to remove a page from the Google Mini Index is highlighted in this document Administering Crawl for Web and File Share Content: Introduction, here it sates that;

  • The license limit is exceeded
  • The crawl pattern is changed
  • The robots.txt file is changed
  • Document is not found (404)

These are the only ways that a page will be removed. As in this scenario, the page still returns a valid page, as it has the same item SKU number, it keeps indexing under the wrong URL potentially forever!

Also it is worth noting that if you are having problems with re-indexing the content of the page rather than the URL of the page then check the “Last-Modified” header that is being returned by the page in the response from the web server. This is particularly an issue in dynamic pages as normally static pages will be dealt with appropriately from the last modified date of the file on the file system of the site. You can study the headers from the page by using a developer tool bar (now built into IE8).

Solution attempt 1

Aha I thought I know how to tackle this. The old URL no longer exists now, as it has been superseded by the new page, thus the ASP.NET site should be issuing a response.status = “301 moved permanently” to force the page out of the Google Mini to index the new page page and register that URL and presumably drop the old URL from the index.

Couple of lines and problem was solved I thought.

If Not officialUriForPage.PathAndQuery.EndsWith(Request.RawUrl) Then
  Response.Clear()
  Response.Status = "301 Moved Permanently"
  Response.AddHeader("Location", utility.GetPublicProductURL( _
            Me.ProductDetails.ProductId, Me.ProductDetails.ItemDescription))
  Response.End()
End If

 

So now the old page will issue a “301 moved permanently” response to the browser and Google Mini, it will go index that new page and drop the old URL – However it don’t work that way.

Solution attempt 2

After the overnight index solution 1 turned turned out a failure. Reading the documentation again it turns out the Goole Mini is being helpful and returning both URL’s, the new and moved URL, for any searches that have a search hit inside the new URL content. It seems that the four methods of removal noted earlier really are the only way to remove a page from the index.

Action

I could put the URL I wanted to remove from the index into the “Don’t Crawl URLS” box of the crawl pattern definition in the Google Crawl admin pages. This would then cause the Google Mini to, after 15 minutes to six hours, re-examine the index and realise this page no longer should be there and remove it. This would be done under the criteria “The crawl pattern is changed”, item two of the list of conditions for removal of pages in the list earlier. I could then remove the don’t crawl URL again from the Google Box so I don’t forget and accidentally block a future new URL replacement. This should work for a few pages, we have about 15,000 products online, need something better.

Instead I went for the last option in the list, “If the search appliance receives a 404 (Document not found) error from the Web server when attempting to fetch a document, the document is removed from the index.”.

Hence I changed the code sample above to redirect to our generic 404 not found page rather than redirecting with the moved redirect. Check that the 404 page responds in the header with a 404 status code or the Google Mini will not see the 404 status. However I don’t want this to happen for end users only the Google Mini. This is because for an end user they just want to be redirected to the URL, a 404 not found is rude and would make lost sales as users assume the item no longer exists. Luckily the Goole box sends a configurable user_agent variable in requests, so we can behave differently to it.

If Not IsNothing(System.Web.HttpContext.Current.Request. _
         ServerVariables("HTTP_USER_AGENT")) _
    AndAlso System.Web.HttpContext.Current.Request. _
         ServerVariables("HTTP_USER_AGENT").Contains("gsa-crawler") Then
    'Not found for Google Mini
    Response.Clear()
    Response.Status = "404 Not Found"
    Response.AddHeader("Location", "/ErrorPages/404.aspx")
    Response.End()
Else
    'Perm redirect
    Response.Clear()
    Response.Status = "301 Moved Permanently"
    Response.AddHeader("Location", common.utility.GetPublicProductURL( _
                Me.ProductDetails.ProductId, Me.ProductDetails.ItemDescription))
    Response.End()
End If

I hope the problem is now resolved.

Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkList

Tags:

ASP.NET | Google Mini

Google Mini excluding ASP.NET page fragments

by Tim 17. February 2009 11:55

You have configured your Google Mini, got it integrated with you site. What you find now is that your results are getting skewed by irrelevant content on your site. This is what I’ve just found.

Exclude unwanted page sections

The result set was upset by the “customers who bought this also bought…” and the site page header and footer. This turned out very simple to resolve. There is a HTML tag that can be used to stop parts of the page from getting indexed. The definition of these are found in this document, excluding Unwanted Text from the Index.
Here are the examples pulled from that documentation for brevity;

<!--googleoff: anchor--><A href=sharks_rugby.html>shark </A> <!--googleon: anchor-->
<!--googleoff: snippet-->Come to the fair!<!--googleon: snippet-->
<!--googleoff: all-->Come to the fair!<!--googleon: all-->

You surround the control or section of the page you do not want to participate in the results with one of the three HTML comment tags shown above. This will not affect the rendering of you page but does mean something to the Google search appliance.

Index: The words between the tags are ignored by Google, they are treated as if they don’t occur on the page at all.

anchor: text in the html anchor tag to another page will not cause that destination page to appear as a result due to the link on this page.

Snippet: the search result will not use the text between the tags in the auto generated snippet that is included in the results.

all: Turns on all the attributes. Text between the tags is not indexed, followed to another linked-to page, or used for a snippet.

To solve my problem googleoff was applied to;

  • “Customers who bought this bought” control reference
  • Product category breadcrumb on the product pages
  • master page header and footers
    This has resulted in “contact us” not returning every page in the site any more, as it used to be linked from every page through the site master pages and made the snippets much more relevant from search results.
    Resulting in much richer results. Caution should be applied to avoid excluding too much of your content from Google as you can’t predict what and why someone is searching on your site. Excluding too much content may hinder them finding what they require or prevent them ever getting what they need.
    Check the documentation for other controls you have available to control the indexing of pages (the crawl).
Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkList

Tags: ,

Google Mini

Google Mini images in results ASP.NET

by Tim 17. February 2009 11:31

Providing search for our ASP.NET site has been left to an implementation of the Google Mini. This choice was for speed of set up, user familiarity with result sets generated and the fact that Microsoft Search was not released when the decision was made. The Google Mini is a 1U high rack mount server hardware supplied by Google. It provides a web browser administrator interface and lets you integrate it into your site in a few different ways. We fire requests to it and get an XML results file back that we then manipulate to produce a good search experience. We use the GSALib port of the gsa-japi Jarva library.

Now we have it up and running it has now become time to start tinkering a bit to try and get better results. The first challenge was getting product image thumbnails shown next to the search results from the Google Mini. This actually was very simple to achieve. The Mini understands meta tags in your html. If you put the following into the <head></head> section of the product pages:
<meta name="tw-prodimg" content="48350_01.jpg" />
Then then when the mini indexes your pages, this tag together with any others present, will be stored as a collection against that page’s result. You must explicitly ask for the meta tags to be included in the XML results from the Mini if you wish to consume them in the resulting XML from the search appliance.

   1: Dim objQuery As New GSA.Query
   2: Dim collections(0) As String
   3: collections(0) = _searchSiteCollection
   4: objQuery.setSiteCollections(collections)
   5: objQuery.setFrontend(_searchFrontEnd)
   6: objQuery.setOutputFormat(GSALib.Constants.Output.XML_NO_DTD)
   7: objQuery.setOutputEncoding(Constants.Encoding.UTF8)
   8: objQuery.setAccess(Constants.Access.PUBLIC)
   9: objQuery.setScrollAhead(CInt(_searchStartPageIndex))
  10: objQuery.setMaxResults(_searchPageSize)
  11: objQuery.setFilter(_searchFilter)
  12: 'Set set FetchMataFields=* to get all meta tags associated with page result
  13: Dim o As String() = {"*"}
  14: objQuery.setFetchMetaFields(o)

 

As you can see a call to setFetchMetaFields has been passed “*” this means return all meta tags from this pages result. You may if you prefer pass a string array of meta tags you are interested in seeing to reduce the returned tags. You may also use meta tags for filtering result sets by meta tag, but not part of this discussion.

Now  the results will include an XML node <MT> that contains the meta tags, this is exposed through the GSA library as a string collection hanging off the search page result it is associated with. Thus we can now show the image on the page using data binding in our repeater control, thus:

<asp:HyperLink ID="sImgLnk" NavigateUrl='<%# Server.HtmlDecode(eval("Url")) %>'
  runat="server" meta:resourcekey="HypImageResource2"></asp:HyperLink>:HyperLink>
   1: Protected Sub repeaterProductResults_ItemDataBound(ByVal sender As Object, ByVal e As System.Web.UI.WebControls.RepeaterItemEventArgs) Handles repeaterProductResults.ItemDataBound
   2:     If (e.Item.ItemType = ListItemType.Item) Or (e.Item.ItemType = ListItemType.AlternatingItem) Then
   3:         Dim SearchResult As GSA.Result = CType(e.Item.DataItem, GSA.Result)
   4:         If SearchResult.Metas.Contains("tw-ItemImg") Then
   5:             DirectCast(e.Item.FindControl("sImgLnk"), HyperLink).ImageUrl = ConfigurationManager.AppSettings("PathToProductThumbs").ToString & SearchResult.Metas("can-ItemImg")
   6:         Else
   7:             DirectCast(e.Item.FindControl("sImgLnk"), HyperLink).ImageUrl = ConfigurationManager.AppSettings("PathToProductThumbs").ToString & "noimage.gif"
   8:         End If
   9:     End If
  10: End Sub

 

Thus you now have item images against the items. You can obviously expand this so that all your pages could have a “searchimage” meta tag so that news items or other content could all have individual thumbs.

Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkList

Tags: ,

Google Mini


Microsoft Certified Solutions Developer
Microsoft Certified Application Developer
Microsoft Certified Technology Specialist

Disclaimer
The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

© Copyright 2010 Dynamic Code Blocks, Tim Wappat