Friday, July 18, 2008

The basics of Text Mining

The basic needs of any text mining effort are:
  • Text Processing (formatting, cleaning, unencoding, encoding, etc.)
  • Determination of co-locations (words that have more meaning together than apart like 'United States of America' is a single concept and should be grouped together.)
  • Determination of parts of speech/role.

You can tell how advanced a system is by what they do with the text before analysis. Digital Reasoning, Attensity and others do all three before analysis. It is a key factor in "exhaustive extraction" and in the creation of advanced structures like associative networks. Without understanding the semantic structure how can one determine the actual meaning of the elements?

I've been getting more and more frustrated with search engines. You have to leap through so many hoops to do the types of searches I've gotten used to with semantics-aware engines. "Remains" is both a predicate and an entity - depending on how it is used. In order to find exactly what I am looking for I have to put in the term and look at what comes back. When I see it is bringing back a lot of incorrect cases I then help it disambiguate the responses by adding in negative examples. So if I am searching on "cold" and got a lot of responses back on low temperature when really I was more interested in the respiratory blockage I would add "-temperature" which is not bad unless some article I might want actually has temperature in the sense of fever ('running a temperature') in which case I would have just filtered it out.

When you are evaluating textual mining solutions it is important to make sure they can provide this kind of functionality. Google is a 90's technology. As simple as it is it does give access to a lot of information but it's simplicity also makes it a lot of work when we are examining large scales of data with subtle requirements in our search. In fact the more important it is to find "The Document" over general documents related to the search topic the more vital a semantic layer becomes.

Consider this - if a vendor is trying to sell you on a solution to determine "customer voice", or as I've heard it referred to elsewhere, sentiment analysis. Ask them how it knows when "sucks" refers to a negative connotation in vacuum cleaner reviews. They may laugh and say that is a minor example but the fact is, if they aren't dealing with it they also aren't dealing with a lot of other factors. Keywords are no longer a useful technology. Semantic understanding is required for subtle detection. Anyone who tells you differently hasn't a leg to stand on.

Thursday, July 17, 2008

Arrrg. Working with other peoples data

The biggest pain with text analytics and data mining is working with other peoples data. Invariably it is all garbage. This file is ASCII, that file is UTF8, this other file has some weird code page, that file is 7bit. It can cause you to pull your hair out. This happens with data from every corporation or even the Federal Government. I was just working with the TraxIntel DB and was having trouble with the analysis missing what I thought was a lot of important information. Of course it was something as simple as just files not formatted the way I was expecting. This should have been expected since TI pulls in data from a lot of different sources. The lessons is, pre-process all of your input and make sure it is formatted exactly the way you want.

I should have known better because when I was working with the public collection of Enron emails they had all sorts of encoding in them that are unique to emails but that aren't text you'd want to analyze. The simple solution was to find plain text versions of the messages. The purist answer of course is to create a filter for various types of potential file formats and fix the input before it hits the semantic analyzers. What a concept.

Anyway, the point of this post is that sometimes when you are debugging this stuff it isn't the high level models and complex software that is broken. "Garbage in means garbage out." A lesson as old as the hills.

Tuesday, July 15, 2008

Artificial = Synthetic

From my perspective, the terms "artificial" and "synthetic" convey the same engineering meaning relative to machine learning. The intelligence of a computer system is derived from a human interpretation, model, of how one believes the learning process evolves. Since this process is bound by this constraint, any derived knowledge can be no better than the underlying algorithm that supports the process. To this end, the application has synthesized the function of the human brain's cognitive capabilities as modeled by the the algorithm. The term "artificial" or “synthetic” simply implies that the learning was done by machine.

Furthermore, I argue that Eliza or any any system that implements a similar fundamental concept are examples of how computers can be programmed to mimic the behavior of a human. I think it is very unfair to claim that Eliza is an AI system. Eliza was simply a program to dupe naive people into believing that the computer was performing an "initial psychiatric interview" with them. It did not offer any form of learning, but rather searched for key words that could be rephrased to answer/create new questions that the user would deem to be plausible. I view Eliza as a teaser to the capabilities that machine learning could offer in the future. Interestingly, I always considered Eliza to be a condemnation of psychotherapy. Based upon comments raised by my blog colleague in our offline discussions, I now recognize the adverse impact that it has had on the perception of AI.

As an engineer, I believe that a fair example of an AI system is a linear/non-linear control mechanism. Such systems are used in various applications, and using real-time data, adapt system performance to ensure that functionality and stability are maintained. A simple example of such a system is a one-step ahead controller. Examples of these systems can be found aboard ships, in machine-based automation, etc.

From many discussions that I have had with my academic peers in sociology and psychology regarding machine learning, I recognize the ongoing philosophical debate that surrounds it (i.e. artificial vs. synthetic). The bottom line is that until the cognitive function of the human brain is mapped, machine intelligence will remain an interpretative representation of this function. Philosophically, I believe that Artificial Intelligence is Synthetic Intelligence.

Monday, July 14, 2008

Artificial vs. Synthetic

Recently around the office there was some debate on the use of artificial intelligence vs. synthetic intelligence. Artifical intelligence implies that the intelligence is not genuine. Syhthetic intelligence suggests the intelligence is not just an imitation but is genuinely a form of intelligence specifically created.

There is an all to short article on the subject at wikipedia. I wish it was quite a bit longer. Its a fresh and healthy debate. While it is philosophical - the whole nature of what we are building is based upon philosophy.

I think for now I am on the synthetic intelligence end of things. The systems I'm seeing are able to perform intellectual tasks that I would do. They don't do them in the same way as I would but they produce similar if not better results and they certainly do it quicker and faster than I do.

Friday, July 11, 2008

Business TN hack job on Digital Reasoning

So I got forwarded this article from Business TN about the company I work for. At first the terrible journalism standards in it made me angry. I used to run a blog that followed the terrible and sensational journalism used in the Natalie Holloway Case. Then I had to sit back and laugh.

My favorite line is an un-cited quote about the CEO's mother (who is a board member) "calling all the shots." Ok lets deconstruct this. First, the poor woman has not been well for many months and has been in treatment. While she is recovered now she pretty much hasn't called any shots for the last year. Second, lets say for a second she was quite influential. What exactly is the problem? Is it that she is a woman? Is it that she is a mother? Really? Come on - how old school is that? It really goes a long way to prove Tim's original assertion that Venture Capital in Middle Tennessee is not really capable of evaluating a good thing when they see it and is driving good business ideas out of this state.

Frankly, if Tim moves the shop I would follow. DRS has great technology that has been proven in the field of battle. It needs to be in the hands of civilians now. Where as we will eventually develop something out of what we are doing with the government, that is a slow process. I have to shrug my shoulders. If that uncredited person actually did say the problem was Tim's mother then there really is no helping the local VC crowd as they are hopelessly short sighted and frankly misogynistic. It is well they weren't named because any female entrepreneurs in state who read their comments would instantly cross them off the list of people they would trust.

The fact that they go un-cited however makes me question the entire article itself however. I wonder how much of that is really just the invention and belief of the author. Now you would be right to say where do I get the right to critisize him when I blog. Well for one, I know bad journalism when I see it. Second I have left open the comments on this blog so if you want to call BS on something I wrote I would gladly engage in that. There are a lot of things in this blog that are opinion and not fact and sometimes I am not as careful to distinguish between them as clearly in my writing as I should.

I'd be glad to hear from anyone in the NLP field that also finds that the VC in their area don't get what they do and don't invest in innovative startups. I am sure it is not just Middle Tennessee. When you do a map of the location of the top 30% of the data mining/text analytics companies they are not surprisingly centered on a short list of hubs.

Thursday, July 10, 2008

Measurement in NLP Development

From the blog post at Digital Reasoning Systems.

I blogged about a cooperative effort between development and QoA that I participated in at my day job at Digital Reasoning. I bring it up here on this blog because it is related to NLP in that an NLP system was developed with the standard measure used to gauge it's success.

In talking to Dr. Kaufman (my co-author on this blog) about it she said that there is a name for that kind of development called The Spiral Design Process. If you have not already visited the DRS blog you should check it out. It is gaining steam with several authors contributing to it. I get to work with some really interesting people so it is nice to see my articles up with theirs.

I think one of the most important lessons is that you should measure early and often. Back in New England as a kid my grandfather was a general contractor. Occasionally he would hire me during the summer in High School to help do finishing work on homes he was building or renovating. He had two pieces of advice for me. One of them was "measure twice, cut once."

Oh and the other was hold the hammer by the lower part of the handle for $5 an hour or up near the head for $0.50 an hour wage.