Over the last year, there have been few discussions that I’ve been forced to observe with greater disinterest than the one regarding LSI. “Does Google use LSI? Are their patents for LSI? What did they buy from Applied Semantics? Can I outrank people by ‘chasing LSI’[sic]?” Watching this completely irrelevant discourse regarding LSI reminds me of one of my favorite quotes.
“This isn’t right. This isn’t even wrong.” -Wolfgang Pauli
Latent Semantic Analysis is an algorithm. An idea. A tool. It has many applications for which it is ideally suited, and some for which it is not. Some of those applications include, but are not limited to (from Wikipedia):
- Compare the documents in the concept space (data clustering, document classification)
- Find similar documents across languages, after analyzing a base set of translated documents (cross language retrieval)
- Find relations between terms (synonymy and polysemy)
- Given a query of terms, translate it into the concept space, and find matching documents (information retrieval)
Obviously if you are a search engine, the information retrieval would be most interesting. If you have a huge amount of research abstracts that you are trying to categorize for research purposes, you’d be most interested in document classification. Cross language retrieval would be right up your alley if you are tracking the history and evolution of Indo-European languages using their dated written histories. If you are trying to discover all of the ways that a particular concept is thought of and referenced, you’d focus on synonymy and polysemy.
The false controversy appears to be related to two of the four examples above, information retrieval and synonymy and polysemy. For some reason, people seem to feel that because a search engine does not use LSI for their information retrieval, it has no place in the construction of a website. That makes about as much sense as saying “Google doesn’t use MacBooks in their data centers so I shouldn’t buy one.” When it comes to building websites, the question of whether or not any search engine uses LSI for results is not only wrong, it isn’t even valid … they are different applications. Let me say that again, louder.
LATENT SEMANTIC ANALYSIS HAS MANY DIFFERENT APPLICATIONS. WHETHER OR NOT ANY SEARCH ENGINE USES LATENT SEMANTIC ANALYSIS FOR ANYTHING HAS NO BEARING WHATSOEVER ON WHETHER OR NOT IT’S USEFUL FOR BUILDING A WEBSITE.
If you remember only one thing from this article, let it be that. There is nothing to debunk, there is nothing to disprove, there is no question here, there isn’t even a conversation to be had. The fact that there has been one anyway speaks volumes about the lack of rigor and clarity in the industry.
So if everybody is asking the wrong question about LSI, what is the right question?
- Kelley








