Thursday, July 9, 2009

Thoughts about Morphological Analysis

This morning I was thinking about the process of morphological analysis. In natural language processing the goal of this analysis is to classify each token into a type of morpheme which is meta data for the token that aids in determining the actual part-of-speech of the token. Advanced parsers offer noun-grouping and occasionally a priori entity identification.

The question came to my mind, how am I doing this process in my head? I believe that the process is ongoing, token by token. I don't read the entire sentence then perform a careful analysis and then conclude the meaning. I am getting the meaning as I read, token by token (sort of). Try this experiment:

  • First, slowly read a sentence one word at a time. If you can, cover the sentence up so that you can't see what is coming and reveal each token one at a time.

  • Then read another sentence you have not seen yet from the middle and guess the meaning. Again try to cover portions of it.

What I've found is that with the first method, things make sense. I don't have any trouble understanding the sentence even if I read very slowly. With the second method I have a lot of trouble guessing what the meaning is for many tokens and end up reading back to the beginning of the current clause. My hunch is that a random access approach to morphology is therefore suspect. That, in fact, understanding of a sentence requires it be treated as a stream and that the decision process starts with the first token and ends with the last. The logical order of the text is the key to understanding it. If you come across the word POLISH, it's meaning (noun (surface cleaner), verb (applying surface cleaner), proper noun (Polish Sausage)) can be discerned by the words preceding it. A statistical analysis of how often POLISH is used as a verb or a noun may be helpful but is not a great starting place as that bias is meaningless if the context suggests otherwise. So why introduce bias at all in the first place?

I consider the problem similar to Texas Hold'em. You get a hand with two cards. Is it a winning hand? You can't know for sure at the start of the round, even with a pair of Aces. Cards are revealed over time and these help you determine the absolute strength of your hand. As we read a sentence there is an expected flow. This expected flow goes beyond diction. It is beyond spelling. It allows us to understand neologisms.

Take for example the LOL Cats. The captions are perfect examples of sentences that break the rules of language so profoundly that they make parents and English teachers cringe. Shakespeare himself spins in his grave. In spite of these critics there are many fans who view the pictures and laugh at the words; a joke having been communicated to them quite clearly. The tokens have chaotic properties given that there are no rules for their formation. They are impressionistic and often derived from slang to start with. A statistical analysis of tokens like "haz" and "mezzzzzzd" or "caturday" likely will produce just about nothing useful. (Though if you, dear reader, have a process that can make sense of tokens like "caturday" using statistical methods I would LOVE to know about it and blog on that achievement.)

Let's take a look at an example where there are many words not familiar to English speakers.

Ne proviso al vi trezorojn sur la tero

In the above clause we have an extreme example of neologism. It is, in fact, written in Esperanto. It is well formed and logical and yet I cannot read it. I can sound out the words and guess at what they may relate to in English but I can't be sure. Is "ne" equivalent to "no"? Let's take an example that will have no equivalent comparison to Romantic language.

ichi go ichi e

The quote is in Japanese and means "one meeting, one chance." The reason I can't figure these out is because there is nothing grounding my reading. I have to know something structurally about the sentence that will clue me into what it means and what each token represents. So, even though there are words that are unfamiliar to me in the LOL Cats captions there are familiar anchors upon which I can begin to divine their meaning. Further, the words, while stylized, do have a connection back to the language in that often they sound like a word that would fit into the sentence at that point. So what we are seeing is not a random collection of tokens but a hidden order or pattern. The reason why is that language is about transmitting concepts and relationships. Objects and actions and their relationship are what we are communicating no matter the language or how abused the syntax is. There are people who have a skill at naturally understanding non-standard use of language. Often language rules are broken severely in music.

I often bring up Rap music because it is where the rules of language are most severely tested and whre the most interesting analysis is for me (even though I don't particularily like the music or the message.) Frankly, if computer software gets to the point of decrypting Rap (aka "hip hop") that will be a major milestone in my humble opinion. The structure is often dictated by a need for vocal rhythm as this kind of music is sung not to a pitch but to a beat provided by the background music. Taken out of the context of the rhythm, rap lyrics are difficult to analyze. Often what is needed from rap lyrics is a rapid projection of sentence fragments that are in sequence and that sequence itself gives the relationship between the concepts and thus the ultimate meaning. Here is a humorous "translation" of a rap song that has been floating around the Internet for a long time. The lyrics are from the artist known as B.I.G. Notorious from his song "One More Chance."

First things first, I poppa, freaks all the honeys
Dummies - playboy bunnies, those wantin’ money

And the suggested translation...

As a general rule, I perform lewd acts with women of all kinds, including but not limited to those with limited intellect, nude magazine models, and prostitutes.

While I cannot condone the treatment of women by B.I.G. Notorious in his music it does give us an interesting problem for analysis. The point here is that the mechanics are more important than the actual meaning. The analysis transcends the creative use of language and the invention of new grammatical structures and neologisms such as "poppa" and the colloquialism "wantin'" along with the fragmentary clauses. The analysis depends upon an expected structure with enough grounding back into the original language to provide familiarity that aids in the decryption. Upon first hearing the song one might not know what "freaks" means in the context of this sentence but, structurally we are hoping for a verb! True understanding of the sentence requires cultural knowledge. Mechanical understanding, however, does not and that is what we are expecting to use natural language processing for.

Going back to my original concept we can look at a word like "freaks" and see statistically it is most likely a noun-plural. Most morphological analysis stops right there. WordNet will tell you it means "addict, monster or to lose one's nerve" which is not what the Big Poppa was trying to say. WordNet is not down with the rap!

Without knowing what the cultural usage means I do see that Poppa --> Freaks --> (honeys, dummies, playboy bunnies, those wantin' money). The relationship between subject(s) and object through the verb are more important than the actual meaning of the words and go beyond what I would call Standard English. From a computer science perspective I see five entities related through one action. Four of the entities are grouped and related by being set members.

My conclusion from this train of thought is that analysis, if on a token by token basis, must not be some random analysis of each token but one that starts at the beginning of the sentence and heavily determines the nature of the current token by it's neighbors which were also determined this way.

I believe the first word in a sentence is a special case. Going back to the Texas Hold'em analogy; this word is your initial hand. You really don't have a lot of information. The better constructed the sentence and the greater your understanding of the language then the more likely you are to understand the use of the word. If the first word is "Polish" which is naturally capitalized at the start of a sentence we have no idea which sense it may be (and even if statistically it is most likely to be a verb, how does that help us?) If the next word is "the" however we have nailed it as a verb. This is because "the" is one of those grounding terms. If the third word is "xyzzy" we have nothing a priori to tell us what it is. However because of the previous word we are certain it is either a noun or the start of a noun group. This analysis hinges on prior discovery in the context. If it was in the middle of a noun group looking at the left and right window might also produce no further useful evidence if the other terms are equally unique and not prviously categorized. Starting from the beginning of the sentence, however, does give us a clue. We already have a verb so we are looking for a subject.

So many systems treat these as edge cases and as "noise" but I think they actually point in the direction systems must take. Language is living and growing and changing. Systems have to expect that and be built with that in mind. When you look at language defining corpora like TreeBank with over 100 different parts of speech you have to wonder if this level of detail is useful because what ends up happening is the parts of speech become bound to a domain, when in fact actual usage is much more dynamic and unbounded. My conclusion is that morphology has to be macro in detail. It has to not be concerned with a priori assumptions based upon statistical use, but at the same time it needs some grounding in structure when dealing with general input. Clearly this is not an approach that will handle everyone's needs. I think for General NLP, however, this philosophy will provide avenues for a variety of robust approaches.

Tuesday, April 14, 2009

David Merrill demos Siftables | Video on

David Merrill demos Siftables | Video on

This is one of the coolest demos I've seen in a long time. Imagine using this kind of technology with machines that move themselves - you'd have not several independent robots but one large, disconnected robot. The idea is really incredible. Imagine that the range is increased a bit from touching to a few feet and now you have extended chains where each link can react to local stimuli and then as a group be able to define a group purpose based on the collected information.

I hope they do a lot more with this technology. It was wonderful seeing it in action.


Sunday, March 8, 2009

Measuring NLP is like Marksmanship

There is an interesting correlation between shooting and NLP improvement - both require mathematical analysis in order to be improved upon. In the case of marksmanship the emphasis is the removal of bias. In NLP the emphasis is the harmonic improvement of recall and precision.

Recall in entity extraction is the finding of correct items in a category. In other aspects of NLP it is the correct networking of terms. In marksmanship one could say that recall is putting the bullet into the X ring - though you still score for being close. Hence the emphasis is really not about hitting the target (it is rare to not hit the target) but more the quality of the hit.

Just for fun lets look at some numbers. These numbers came from my shooting at 100 yards without a brace using a carbine. 2 shots were "flyers" as in totally off the paper. Those are in the 2nd quadrant. I list the quadrants as 1st (upper left), 2nd (upper right), 3rd (lower right) and 4th (lower left). The numbers show I am biased up a bit. This is actually not too unexpected as I know that with the trajectory of the bullets I use and the range to the target they are still rising to their apogee. From the numbers I see I am shooting a little to the right as well.

The other number of value is my average score. I got an 8.88 which means my shots were with the same space as a cantaloupe. Ideally I'd like to be hitting something the size of a baseball. That would probably happen if I used a brace.

Marksmen use other measures as well. Grouping is important. In NLP that would be analogous to clustering.

The point I am trying to make is that the marksman uses a lot more numbers to refine his process. Why aren't we doing the same with NLP? My first forays into the world of marksmanship were without knowledge of the math behind the shooting and when I had groups the size of a medium pizza at 100 yards I wanted to improve my skill. All of these numbers help me to diagnose a wide range or problems. I learned I was able to tell where my problems were (breathing, stance, grip, trigger pull, etc.)

We seem to have a lack of these controls in NLP at the moment. I have been an advocate of using f-measure in the development arena early on so that algorithmic changes can be evaluated. I am also an advocate of having a running f-measure as one trains an entity extraction system. However, this is just a first step. I now advocate that we start to develop better controls through stronger math to help the scientist, developer or trainer to better understand their NLP system and make corrections.

This with have further benefits when the system in question is being automatically trained by another software system. Automating the process would require a strong understanding but would have the advantage of removing random bias that humans tend to add to training.

Target Scoring         
March 8th, 2008
1st Quad
0 7 8 9 10 X Total Score Avg. Score
0 0 4 5 2 0 11 97 8.82
2nd Quad
0 7 8 9 10 X
2 1 2 4 0 1 10 79 7.90
3rd Quad
0 7 8 9 10 X
0 0 2 1 4 0 7 65 9.29
4th Quad
0 7 8 9 10 X
0 0 0 1 1 0 2 19 9.50
Totals 2 1 8 11 7 Bias Up 12 260 8.88
Bias Left -4

Wednesday, February 4, 2009

Open Source Text Analytics

Seth Grimes has a new article up worth reading. My friend Bill Day pointed this one out to me this morning. It is a great read for those looking to delve into data mining using open source tools. He is quite honest saying up front that these lack a bit of polish and that of course is to be expected. That said - the price can't be beat!

Give them a look!


Tuesday, November 18, 2008

Can you use adjectives to infer nouns in entity extraction?

I will limit this case to location extraction since this is a common case and one where I seem to find a lot of confusion. To give away the punch line the definitive answer is no, you can't use adjectives. The full answer though is more subtle than that. Read on, read on.

The use of adjectives for entity extraction looking for locations has many flaws. Semantically, a location can only be a noun. The theory is that in a sentence such as "The ship was beached along the Boston Coastline" that the word Boston is the location. Boston in this context is an adjective. There are two cases where this is provably bad and incorrect and one case where it just simply is misleading.

Case 1: Location Name has Multiple Meanings:

There are location names that are perfectly valid that have additional common meaning. Bad, Iran or Train, Germany or Hit, Iraq or Car, Romania are just 4 examples of hundreds worldwide. If we take the adjective case then when we find "train station" or "bad karma" or "I'd hit that" (ok that's a predicate, you got me - I just wanted to use that phrase!) or "car wash" then we can infer locations. This is clearly absurd. For names like Boston that seem to be unique we find that they have additional meaning. Boston is a transliteral variant of the Uzbek word Baystan. What it means I have no clue.

Case 2: Adjective looks like a location but isn't

There are many of these. Boston cream pie, Boston Bruins, Boston Marathon, New York minute, etc. No discussion really needed here.

Case 3: Adjective is sort of a location but is misleading

Boston Marathon also falls in this example. The Boston Marathon does indeed terminate in Boston. In fact less than 5% of the race is in the city of Boston. The other 95+% runs through many places. Along the way you find another misleading place called Boston College at the top of Heartbreak Hill. The problem there is that Boston College is not in Boston. It is in Chestnut Hill.

Further, these types can be used as locations. Ex. I am going to the Boston Marathon or I am heading to Boston City Hall. They are destination - but they are more specific than just the city.

What is the correct method of dealing with these cases? A good semantic system not only is aware of the parts of speech of individual words, it recognizes important co-locations of words. So instead of seeing:

The: Determiner
Boston: Adjective
Coastline: noun
is: predicate
beautiful: modifier
.: end of sentence

It would see

The: determiner
Boston Coastline: noun
is: predicate
beautiful: modifier
.: end of sentence

In an ideal solution Boston Coastline would then be in your gazetteer with correct coordinate representation. While the software I work on does not handle linear or polygonal targets and could not represent even a bounding box for this type of target today it would not be terribly difficult to implement. It is a simple matter of programming. Entity extraction software that correctly groups terms like "Boston Coastline" are advanced. That is what our software does. That is a significant problem to solve. Once done it is simply a matter of having a properly populated gazetteer and an ability to distinguish in the database between point and area features.

Finally a good semantic solution looks not only look at the words but how they are used in context.

I like to travel to Train, Germany.


I like to train in Germany.

In the first case we definitely are talking about Train as a noun and location. In the second case it is a predicate and not a location. Imagine further that this is all caps data and it becomes an even more difficult problem.

So avoid the trap of thinking that placename lables are useful without having an understanding of their groupings, parts of speech and context.

Monday, August 18, 2008

Getting rid of Google from news posts

I just noticed that google was very prominent in the news bar at the left. However the articles weren't very relevent to data mining, text analytics, unstructured data analytics or robotics. So I tried to add "-google" to the list. Guess what - no joy. Then I changed the query to the company I work for "digital reasoning systems" and guess what, Google comes up! Then I tried "Kiva Systems" and they come up again. Put in robotics and they go away. Put in an excluder for any term in the headline of one of the returned results and it goes away. Google is cheating on their own system to make it so you can't exlude news about their company even thought it is irrelevent! They also have decided to associate themselves with two unrealated companies. I don't know but it smells dishonest to me. Maybe someone can explain to me why these searches come up with Goodle so prominantly.

Startup Kiva's New (Robotic) Approach to Order Fulfillment - Brightcove

Startup Kiva's New (Robotic) Approach to Order Fulfillment - Brightcove

The Kiva CEO talks about warehouse automation. Its a very interesting interview. I didn't get to meet him when I visited Kiva but perhaps on my next visit. The video nicely shows the drive units at work. While they look small and weak, trust me, they will take your leg in second if you wander onto their pathways! There are safety devices of course but I'm not willing to trust life and limb to them! I love these systems.

One other thing you get to see is the software they use to help the robots get organized. The whole system isn't shown and I don't think you could casually show it in such an interview. I really enjoyed using it and it clearly has evovled nicely.