Jan 232011

Our first post of the New Year looks at the topic of measuring SEO indexation. Inspired by Rob’s post on Distilled, and a wish to revisit some of my previous work on the topic, I thought it might be interesting to share a method of collecting data to build a clearer understanding of page level indexation.

Seoul Design Expo

Image credit: Justin De La Ornellas

Hopefully by the end of this post you’ll have a few new methods to collect site index data for your own SEO studies.

Why do SEO’s need an understanding of the principles of indexation?

How hard is your website working for you? Which pages and content groups yield the most benefit, traffic wise? Are there any weak spots, groups of pages that don’t seem to be working well? How can you make changes to navigation, architecture and sometimes page layout to improve a website’s overall search engine visibility or long tail traffic performance? These are questions that should occur to an SEO on a regular basis – but coming to a reliable answer is not always straightforward.

Seeking answers to indexation and site architecture related questions is a worthy cause, but achieving a meaningful answer is a significant hurdle to overcome. All of the (excellent) resources on this topic tend to approach indexation from the perspective of analytics data, or content grouped together inside sitemaps. What about individual pages, though? I use the term page level indexation, because I’m seeking a granular, page level answer to my indexation questions.

What data sources can tell us how a website has been indexed by Google?

For me, there are a number of approximate indicators that a page (or page group) is indexed – for example, reporting on pages that receive at least one entry from Google via Analytics. You might wish to take a look at the “URLs in Web Index” report in the sitemaps section of Webmaster Tools. Savvy webmasters and SEO’s may even use multiple sitemaps to get clearer page group level insight.

Logfiles will tell you a lot about GoogleBot visits, but not indexation, so where else can we look for inspiration?
Number of pages included in Web index according to Google

Expressing the number of pages included in a web index as a percentage, but exactly which URLs are included?

Collecting indexed pages data using the “cache” query

As a method to compliment your existing approach, you might find this methodology quite interesting. The outcome will be a page by page URL list for your site, where Google cache data, SEOmoz custom crawl and XENU data will give you a cracking starting point for you to diagnose your indexation problems. These steps involve having a Mozenda account, although you could do the same (or similar) by building your own crawler or using 80Legs.

Collect Google cache data via a proxy

Fundamentally we’re going to be executing a series of Google cache queries via a proxy. With Mozenda you have to have a method of distributing queries to Google via a proxy. even then it gets unreliable quickly if you overcook the requests. If you use a simple PHP proxy and go very, very slowly, you’ll probably be alright.

Get a URL list

For this, you’ll need a list of all of the URLs your site can generate. The easiest way to get to this list is to extract all of the URLs from your XML sitemap(s) or ask your developer. Remember that, if you crawl your site with XENU you might miss orphaned pages.

Build a Mozenda agent scraper

Your crawler needs to execute the Google cache query and should be configured to capture the URL and cache date.

http://webcache.googleusercontent.com/search?sourceid=chrome&ie=UTF-8&q=cache:[URL]

Would result with:

cache date from Google

If no results are found, your agent needs to be able to record the alternative result. When you’re happy with the agent you’ve built, upload and run the agent. Execute this very slowly (proxy in image is a publicly available service – proceed with caution).

running

Mozenda running my cache agent:

cache data being collected

Combine your new data with a few other sources

While your cache scraper is running, think about where else you could gain insight through combining your data sources. Let’s not forget we’re trying to locate pages that are not indexing. Some of the data points you could include may be;

- Click depth of URL from home page
- Internal links out from page
- Internal links in to page
- Meta robots
- X-Robots in server header
- Status code response

All of these data points can be gathered from two sources – Xenu’s link sleuth and the SEOmoz Custom Crawl tool. Xenu needs little introduction, but few know that click depth, internal links in and out of a page are part of the available data. SEOmoz’s Custom Crawl is awesome, and includes data on the server header response, contents of the X-Robots tag, meta title and rel canonical target.

Custom crawl

Having a list of all URLs on your site, with a definitive answer on click depth, number of internal links and the Google Cache status is a very interesting piece of data to have, but (of course), it can be extended even further.

If you’re looking for a larger crawl of your site, but the same data, Adam from SEOmoz has pointed out you can get 10,000 pages + (depending on your membership level) crawled and exported from the SEOmoz Pro Account:

seomoz pro

You can find this data via the “Crawl Diagnostics” tab in your campaign dashboard. Thanks Adam!

Content grouping

Most websites have a relatively simple approach to content types via their URL formation. This blog, for example, uses “/category/” in the URL to indicate the category content type. Paginated URLs might appear as “/page/*/”. If you’re a retail site, perhaps your product pages contain “/product/”.

By using an Excel query to group your contnet types, you’ll have the ability to get a sense of overall indexation in an area of your site, without having to group the sitemaps together. Try something like:

=NOT(ISERROR(SEARCH(“[URL CHUNK]“,Table3[[#This Row],[URL]],1)))

Where “[URL CHUNK]” could be “/page/”, “/products” or whatever. The outcome is “TRUE” if your URL belongs to a recognised group, and “FALSE” if it doesn’t.

Entries via Google to URL

With a simple VLOOKUP, you can combine traffic numbers by URL in your indexation data. This might help highlight pages that *should* have a little traffic from Google, but don’t – or at least you’ll have another point of reference for your investigations.

Landing Pages

The end result

Here’s a screenshot of the example data I built while writing this post. You’ll see all of the data I’ve mentioned in this post, along with a number of “content groups” I found most relevant to my blog. There are some properly configured duplicated pages with SEOgadget which, I can confidently report, are not cached, nor are they generating traffic. My data tells me that the paginated URLs on the homepage, category and tag pages are properly set to noindex but that those pesky comment pages (where a blog post has more than a certain level of comments, we paginate them) are misbehaving (they should be set to noindex). Time to roll my sleeves up.

Click to enlarge…

Indexation Data

I hope you can see from this screenshot how you might benefit from combining data into a single point to identify, diagnose and fix indexation issues on your site. Of course there are other data sources out there, and we’ve not touched on the visual aspect of representing this data, which I’m saving for another post.

In the meantime, I’d really like to hear your thoughts, particularly on the data you might choose to help diagnose your architecture and indexation issues.

Page Level Search Engine Indexation [Data & Collection Methodology] is one of our latest posts from: SEOgadget.co.uk.

May 302010

A few weeks ago I got quite interested in measuring true indexation levels and potential value metric thresholds that may or may not affect the likelihood of a page being indexed in Google. After collecting some initial data I realised what a huge undertaking that passing interest actually was.

indexation?

Image by: Nancy Wombat

There’s no conclusion in this post, at least not yet.  This is more a discussion on what I’ve observed so far while working with, and collecting data that could answer some of the questions I have. The tweet below says it all – I think I’ve had as much fun getting the  data as I have working with it.

twitter conversation

What is indexation?

I became interested in collecting data that could help me understand indexation levels on a website. Actually defining the meaning of indexation, though, is an important first step. I’m of the opinion that “indexation” means the number of pages from a website that are included in Google’s index. “Indexation” shouldn’t mean “rank”, because other factors (authority metrics, relevance) play a role in any given URL ranking for a specific query in a search engine. A page can be indexed, but it might not rank in a position for a query that any normal search engine user (non-SEO) would ever see.

This idea begs the question – is indexation the number of pages that receive one or more entries from a search engine over a given period of time? Analytics data is only one source of information on the performance of any given URL and I’ve led myself to the conclusion that analytics numbers only become powerful when combined with other data sources.

Combining data sources for an overall impression of indexation in Google

In a quest to construct a better impression of indexation on my example site, I set about on a data collection mission. First, I’ll describe what data I’ve been collecting:

  • All URLs on a domain
  • All URLs that have an internal link (Google Webmaster Tools)
  • The response (positive or negative) to a Google cache: query for each URL
  • Analytics entries to each of the URLs
  • MozRank for each URL
  • PageRank for each URL

Methods to collect the data (for the non developer)

Getting your hands on a snapshot of all URLs on a site is relatively easy with a tool like Xenu’s Link Sleuth. Just be sure to make sure that URLs don’t time out during the crawl, and if they do, recrawl those values. If you have a site of say, less than 3000 pages you could give the Custom Crawl prototype a try at SEOmoz.

Google Webmaster Tools data can be very useful, particularly the internal links report. The data on all URLs with at least one internal link tells us that Google has discovered the URL with an internal link. A fair assumption would be that the URLs listed in this report have also been crawled, that’s the assumption I make in my data but I’m always really pleased to hear if you think this is correct.

To gather cache data from Google, I opted to recruit the new kid on the SEO tools block, Mozenda. In principle, you’re using Mozenda to scrape Google cached pages, recording the cache date, URL, cache time and taking note of what I call a fail safe. A fail safe in a Mozenda crawl is an item of text you’ll only find on a positive result for a cache query. For example “This is Google’s cache” only appears in text if the query result for a cached page is positive. I use a fail safe because I noticed the crawl agent was missing some data on occasional crawl cycles.

It’s really easy to construct an agent to do this kind of thing, and I suspect using 80Legs is quite simple too.

Mozenda and 80Legs LogoA quick note on 3rd party crawlers

If you’re going to crawl Google to scrape their data, execute the agent via your own proxy. PHP proxies are really easy to deploy.  Go easy on crawl rate too – with new capabilities for SEO data collection comes data greed, executing too many requests at once and at too fast a rate. If you’re doing this, you’re ultimately risking your own ability to collect data at all. If data scrapers are working from a handful of IP addresses, I’m quite sure they’ll be blocked from making requests by the big guys like Google, Amazon, et al, eventually.

If you want to do a serious site crawl, say 100,000 page load requests or more, expect to spend something in the region of a total of $249 for the bandwidth and $399 for the registration.

Back to the data collection

Analytics data plays a role in my data set, using the &limit= query string to ensure that all of the landing page data from “Traffic Sources > Search Engines > Google > Landing Page” is neatly extracted in as few CSV exports as possible.

MozRank can be scraped quite easily using Mozenda via the free SEOmoz API (or if you’re a developer, a quick PHP script should be quite easy). I captured PageRank in a similar manner.

A sample of the results so far

Here’s a sample chart of the data showing a selection of subfolder metrics:

chart

In this chart I’m looking at taxonomy subfolders such as category and tag based content. The chart shows the number of cached pages in each subfolder, the number of pages in the subfolder that have PageRank and a count of URLs that received one or more entry from Google organic search. The folders above are likely to attract few if any external links, and generate many URLS through the sheer number of tags assigned and large levels of paginated navigation. From an indexation point of view it feels like this type of URL is a great starting point to observe quirky or interesting indexation behaviour.

I found it quite fascinating that many pages in the tags subdirectory are cached, but proportionally fewer have PageRank and drive any traffic. Tag pages are not like normal web pages, in that there are many pages which are all slightly different by one or two words on each page. Regardless of a lack of diversity, you’d expect (or hope) that they’d be capable of generating more long tail traffic that they actually do. In reality (I’ve seen this many many times) default tag page templates tend to drive little traffic in real world applications.

The chart above makes more sense when you add the total number of URLs in each subfolder, although I apologise in advance that the colour scheme changes!

chart 2

An indexation ratio

Is a measure of indexation best described by a ratio? What role can a quality indicator play in this ratio? Certainly my initial thoughts are to take a folder by folder approach looking at the number of URLs vs indexed (cached) URLs. This is where I think analytics data can really play a role, in helping understand how “employed” all content is in any specific area of a site. I’m going to be thinking about this more in the near future.

What’s next?

There are some questions I’d like to continue to attempt to answer, most notably, is there a quality threshold below which the likelihood of a URL being “indexed” is much lower? The early data just tells me that I need to collect more data, studying a larger site with a higher levels of indexation issues. Certainly since the May Day update, I have a general sense that regardless of relevance, a page without the right quality signals might struggle to rank or stay visible in the main index. Getting a complete picture of those quality signals is very hard, particularly with the lack of granularity in PageRank values, and completeness in crawl with 3rd party link analysis tools.

I’d love to hear comments or suggestions on how you think indexation should be measured and, based on the data sources I mentioned above, how you would report site (or subfolder) indexation levels. My work here is far from complete and I’d be delighted to hear from anyone who has thoughts on the topic.

Measuring Indexation Levels in Site Architecture is one of our latest posts from: SEOgadget.co.uk, UK SEO consultants helping people and organisations succeed in search.