Friday, July 18, 2008

The basics of Text Mining

The basic needs of any text mining effort are:
  • Text Processing (formatting, cleaning, unencoding, encoding, etc.)
  • Determination of co-locations (words that have more meaning together than apart like 'United States of America' is a single concept and should be grouped together.)
  • Determination of parts of speech/role.

You can tell how advanced a system is by what they do with the text before analysis. Digital Reasoning, Attensity and others do all three before analysis. It is a key factor in "exhaustive extraction" and in the creation of advanced structures like associative networks. Without understanding the semantic structure how can one determine the actual meaning of the elements?

I've been getting more and more frustrated with search engines. You have to leap through so many hoops to do the types of searches I've gotten used to with semantics-aware engines. "Remains" is both a predicate and an entity - depending on how it is used. In order to find exactly what I am looking for I have to put in the term and look at what comes back. When I see it is bringing back a lot of incorrect cases I then help it disambiguate the responses by adding in negative examples. So if I am searching on "cold" and got a lot of responses back on low temperature when really I was more interested in the respiratory blockage I would add "-temperature" which is not bad unless some article I might want actually has temperature in the sense of fever ('running a temperature') in which case I would have just filtered it out.

When you are evaluating textual mining solutions it is important to make sure they can provide this kind of functionality. Google is a 90's technology. As simple as it is it does give access to a lot of information but it's simplicity also makes it a lot of work when we are examining large scales of data with subtle requirements in our search. In fact the more important it is to find "The Document" over general documents related to the search topic the more vital a semantic layer becomes.

Consider this - if a vendor is trying to sell you on a solution to determine "customer voice", or as I've heard it referred to elsewhere, sentiment analysis. Ask them how it knows when "sucks" refers to a negative connotation in vacuum cleaner reviews. They may laugh and say that is a minor example but the fact is, if they aren't dealing with it they also aren't dealing with a lot of other factors. Keywords are no longer a useful technology. Semantic understanding is required for subtle detection. Anyone who tells you differently hasn't a leg to stand on.

No comments: