Natural Language Processing|April 11, 2009 11:39 pm

Introducing Pseugle

Actually, there is no such thing as Pseugle, it’s just a convenience derived from pseudo-Google for describing a proposed process for evaluating ranking factors using simple statistical correlation. The idea is to combine a generated index (pseudo-Google) against the Google index to see how relevant it is (or any search engine). We’ll define a ranking factor as some attribute, either on-page or off, that is believed to contribute to a higher ranking in a search engine. Could be inbound links, inbound anchor text, PageRank, theme density, keyword density, meta-content, phases of the moon, whatever floats your boat.

Typical ranking factor research is done by owning many websites that one ranks, adjusts pages for the ranking factor, waits a while, then checks again. There are many problems with this method but the largest problems are sample size and time. There are so many unknown variables as time elapses that could affect your rank that even with thousands of pages, something that seems conclusive is likely indistinguishable from noise. Analyzing only your own pages is insufficient as well .. you’d also need to analyze all of the pages you were up against both at the time of the initial ranking and later to make sure they didn’t also make the same ranking factor modification that you did (or remove it). To give an even reasonably statistically significant answer you’d probably need to check a few thousand terms and probably at least the top 100 pages for each term. And of course, with that data set, it’s no longer a matter of simple ‘up’ or ‘down’, it’s now quite a bit more complicated to reach some conclusion. One person might have great results while another person will suffer using exactly the same ranking factor because the testing methodology was insufficient.

To more accurately test any given ranking factor (or combination of ranking factors) without even running your own website, I propose creating and correlating a Pseugle Index using the following methodology:

1. Fetch a set of pages from a search engine with their rankings
2. Rank those pages using your proposed ranking factor(s)
3. Check the statistical correlation between the two to see if it it’s positive, negative, or irrelevant

That’s really all there is to it. If you have a high positive or negative correlation, congratulations! Your ranking factor probably means something. The best part is that you don’t need to wait around. If it correlates positively, sites that use your ranking factor(s) will already be weighted appropriately.

The catch. There is always a catch. In order to properly perform this analysis, you’ll need to run it on a decent sample set. Let’s make an example using 50 pages and 500 terms.

First you have to fetch the rankings for each term, and rankings are different for 50 at a time vs 10 at a time (we’ll discuss this another time), so you’ll need 2500 queries (5*500) to your search engine of choice. If your ranking factor is something that requires you to fetch each page, that’s an additional query for each page or 25000 (50*500) query. That brings us to 27500 queries just to analyze the factors present on a single page. Want to test PageRank? That’s another 25000 queries. Want to check domain age? That’s an upper bound of another 25000 queries with your tool of choice. Want to check inbound links using some other tool? Another 25000 queries. Yah .. it gets expensive, but that doesn’t mean it can’t be done.

This is all fine in theory, but does this actually work? Let’s find out!

First we’ll create a scatter plot of two data sets which are ranked randomly. The net result of this should be that r=0 should be the average and there should be very little variance. Each data point represents a single r, and each r involves a correlation of two sets of 100 pieces of data.

Random Pseugle Index Correlation

Random Pseugle Index Correlation

As you can see .. this is a pretty tightly packed group that hovers around irrelevant and that’s easy to see that our average r = 0.

Next we’ll create a scatter plot of two data sets which are not so random. For the first data set, we’ll use the rankings in an unnamed search engine and for the second, we’ll haphazardly assemble a few commonly accepted ranking factors (I’m not going to tell you which because the point of this isn’t about recommending certain factors, it’s about telling you how to more thoroughly test for them.)

picture-61

Common Ranking Factor Correlation

This one is all over the map, which means that my random ranking selection wasn’t that great, though it was slightly positive (r = 0.096). This means my model is slightly better at predicting rankings than random, but not by a whole lot. Let’s try adding and tweaking a few ranking factors to see what happens.

Tweaked Common Ranking Factors

Tweaked Common Ranking Factors

This also has a pretty good deviation, but as you can see it’s a little more consistent and our average is now r=0.163 with a smaller standard deviation, quite an improvement over what was there before. The idea is that the more accurate your Pseugle algorithm became, the more consistent and closer to 1 your r would become. If you could pull that off then you’d know exactly what you would have to do, with high probability and accuracy, to achieve an arbitrary ranking. That is the real benefit of doing this kind of analysis .. if you have a good Pseugle, you can make predictions and approximate what you’d have to do (in other words, spend) to rank for something. Wouldn’t that be something?

Let me wrap this up by reiterating that this is not an endorsement for any particular set of ranking factors or weights, just a proposed idea about how to go about testing them as accurately and comprehensively as possible. Think you know how all of the important ranking factors and how relatively important each one is? Put it to the test!

- Kelley

Related posts:

Google Caffeine is Live at the First Data Center
Theme Zoom Revolution Begins
Is Google Out to Kill Third Party Keyword Tools?