Machine Uplift: 2009

This morning I was thinking about the process of morphological analysis. In natural language processing the goal of this analysis is to classify each token into a type of morpheme which is meta data for the token that aids in determining the actual part-of-speech of the token. Advanced parsers offer noun-grouping and occasionally a priori entity identification.

The question came to my mind, how am I doing this process in my head? I believe that the process is ongoing, token by token. I don't read the entire sentence then perform a careful analysis and then conclude the meaning. I am getting the meaning as I read, token by token (sort of). Try this experiment:

First, slowly read a sentence one word at a time. If you can, cover the sentence up so that you can't see what is coming and reveal each token one at a time.

Then read another sentence you have not seen yet from the middle and guess the meaning. Again try to cover portions of it.

What I've found is that with the first method, things make sense. I don't have any trouble understanding the sentence even if I read very slowly. With the second method I have a lot of trouble guessing what the meaning is for many tokens and end up reading back to the beginning of the current clause. My hunch is that a random access approach to morphology is therefore suspect. That, in fact, understanding of a sentence requires it be treated as a stream and that the decision process starts with the first token and ends with the last. The logical order of the text is the key to understanding it. If you come across the word POLISH, it's meaning (noun (surface cleaner), verb (applying surface cleaner), proper noun (Polish Sausage)) can be discerned by the words preceding it. A statistical analysis of how often POLISH is used as a verb or a noun may be helpful but is not a great starting place as that bias is meaningless if the context suggests otherwise. So why introduce bias at all in the first place?

I consider the problem similar to Texas Hold'em. You get a hand with two cards. Is it a winning hand? You can't know for sure at the start of the round, even with a pair of Aces. Cards are revealed over time and these help you determine the absolute strength of your hand. As we read a sentence there is an expected flow. This expected flow goes beyond diction. It is beyond spelling. It allows us to understand neologisms.

Take for example the LOL Cats. The captions are perfect examples of sentences that break the rules of language so profoundly that they make parents and English teachers cringe. Shakespeare himself spins in his grave. In spite of these critics there are many fans who view the pictures and laugh at the words; a joke having been communicated to them quite clearly. The tokens have chaotic properties given that there are no rules for their formation. They are impressionistic and often derived from slang to start with. A statistical analysis of tokens like "haz" and "mezzzzzzd" or "caturday" likely will produce just about nothing useful. (Though if you, dear reader, have a process that can make sense of tokens like "caturday" using statistical methods I would LOVE to know about it and blog on that achievement.)

Let's take a look at an example where there are many words not familiar to English speakers.

Ne proviso al vi trezorojn sur la tero

In the above clause we have an extreme example of neologism. It is, in fact, written in Esperanto. It is well formed and logical and yet I cannot read it. I can sound out the words and guess at what they may relate to in English but I can't be sure. Is "ne" equivalent to "no"? Let's take an example that will have no equivalent comparison to Romantic language.

ichi go ichi e

The quote is in Japanese and means "one meeting, one chance." The reason I can't figure these out is because there is nothing grounding my reading. I have to know something structurally about the sentence that will clue me into what it means and what each token represents. So, even though there are words that are unfamiliar to me in the LOL Cats captions there are familiar anchors upon which I can begin to divine their meaning. Further, the words, while stylized, do have a connection back to the language in that often they sound like a word that would fit into the sentence at that point. So what we are seeing is not a random collection of tokens but a hidden order or pattern. The reason why is that language is about transmitting concepts and relationships. Objects and actions and their relationship are what we are communicating no matter the language or how abused the syntax is. There are people who have a skill at naturally understanding non-standard use of language. Often language rules are broken severely in music.

I often bring up Rap music because it is where the rules of language are most severely tested and whre the most interesting analysis is for me (even though I don't particularily like the music or the message.) Frankly, if computer software gets to the point of decrypting Rap (aka "hip hop") that will be a major milestone in my humble opinion. The structure is often dictated by a need for vocal rhythm as this kind of music is sung not to a pitch but to a beat provided by the background music. Taken out of the context of the rhythm, rap lyrics are difficult to analyze. Often what is needed from rap lyrics is a rapid projection of sentence fragments that are in sequence and that sequence itself gives the relationship between the concepts and thus the ultimate meaning. Here is a humorous "translation" of a rap song that has been floating around the Internet for a long time. The lyrics are from the artist known as B.I.G. Notorious from his song "One More Chance."

First things first, I poppa, freaks all the honeys
Dummies - playboy bunnies, those wantin’ money

And the suggested translation...

As a general rule, I perform lewd acts with women of all kinds, including but not limited to those with limited intellect, nude magazine models, and prostitutes.

While I cannot condone the treatment of women by B.I.G. Notorious in his music it does give us an interesting problem for analysis. The point here is that the mechanics are more important than the actual meaning. The analysis transcends the creative use of language and the invention of new grammatical structures and neologisms such as "poppa" and the colloquialism "wantin'" along with the fragmentary clauses. The analysis depends upon an expected structure with enough grounding back into the original language to provide familiarity that aids in the decryption. Upon first hearing the song one might not know what "freaks" means in the context of this sentence but, structurally we are hoping for a verb! True understanding of the sentence requires cultural knowledge. Mechanical understanding, however, does not and that is what we are expecting to use natural language processing for.

Going back to my original concept we can look at a word like "freaks" and see statistically it is most likely a noun-plural. Most morphological analysis stops right there. WordNet will tell you it means "addict, monster or to lose one's nerve" which is not what the Big Poppa was trying to say. WordNet is not down with the rap!

Without knowing what the cultural usage means I do see that Poppa --> Freaks --> (honeys, dummies, playboy bunnies, those wantin' money). The relationship between subject(s) and object through the verb are more important than the actual meaning of the words and go beyond what I would call Standard English. From a computer science perspective I see five entities related through one action. Four of the entities are grouped and related by being set members.

My conclusion from this train of thought is that analysis, if on a token by token basis, must not be some random analysis of each token but one that starts at the beginning of the sentence and heavily determines the nature of the current token by it's neighbors which were also determined this way.

I believe the first word in a sentence is a special case. Going back to the Texas Hold'em analogy; this word is your initial hand. You really don't have a lot of information. The better constructed the sentence and the greater your understanding of the language then the more likely you are to understand the use of the word. If the first word is "Polish" which is naturally capitalized at the start of a sentence we have no idea which sense it may be (and even if statistically it is most likely to be a verb, how does that help us?) If the next word is "the" however we have nailed it as a verb. This is because "the" is one of those grounding terms. If the third word is "xyzzy" we have nothing a priori to tell us what it is. However because of the previous word we are certain it is either a noun or the start of a noun group. This analysis hinges on prior discovery in the context. If it was in the middle of a noun group looking at the left and right window might also produce no further useful evidence if the other terms are equally unique and not prviously categorized. Starting from the beginning of the sentence, however, does give us a clue. We already have a verb so we are looking for a subject.

So many systems treat these as edge cases and as "noise" but I think they actually point in the direction systems must take. Language is living and growing and changing. Systems have to expect that and be built with that in mind. When you look at language defining corpora like TreeBank with over 100 different parts of speech you have to wonder if this level of detail is useful because what ends up happening is the parts of speech become bound to a domain, when in fact actual usage is much more dynamic and unbounded. My conclusion is that morphology has to be macro in detail. It has to not be concerned with a priori assumptions based upon statistical use, but at the same time it needs some grounding in structure when dealing with general input. Clearly this is not an approach that will handle everyone's needs. I think for General NLP, however, this philosophy will provide avenues for a variety of robust approaches.

There is an interesting correlation between shooting and NLP improvement - both require mathematical analysis in order to be improved upon. In the case of marksmanship the emphasis is the removal of bias. In NLP the emphasis is the harmonic improvement of recall and precision.

Recall in entity extraction is the finding of correct items in a category. In other aspects of NLP it is the correct networking of terms. In marksmanship one could say that recall is putting the bullet into the X ring - though you still score for being close. Hence the emphasis is really not about hitting the target (it is rare to not hit the target) but more the quality of the hit.

Just for fun lets look at some numbers. These numbers came from my shooting at 100 yards without a brace using a carbine. 2 shots were "flyers" as in totally off the paper. Those are in the 2nd quadrant. I list the quadrants as 1st (upper left), 2nd (upper right), 3rd (lower right) and 4th (lower left). The numbers show I am biased up a bit. This is actually not too unexpected as I know that with the trajectory of the bullets I use and the range to the target they are still rising to their apogee. From the numbers I see I am shooting a little to the right as well.

The other number of value is my average score. I got an 8.88 which means my shots were with the same space as a cantaloupe. Ideally I'd like to be hitting something the size of a baseball. That would probably happen if I used a brace.

Marksmen use other measures as well. Grouping is important. In NLP that would be analogous to clustering.

The point I am trying to make is that the marksman uses a lot more numbers to refine his process. Why aren't we doing the same with NLP? My first forays into the world of marksmanship were without knowledge of the math behind the shooting and when I had groups the size of a medium pizza at 100 yards I wanted to improve my skill. All of these numbers help me to diagnose a wide range or problems. I learned I was able to tell where my problems were (breathing, stance, grip, trigger pull, etc.)

We seem to have a lack of these controls in NLP at the moment. I have been an advocate of using f-measure in the development arena early on so that algorithmic changes can be evaluated. I am also an advocate of having a running f-measure as one trains an entity extraction system. However, this is just a first step. I now advocate that we start to develop better controls through stronger math to help the scientist, developer or trainer to better understand their NLP system and make corrections.

This with have further benefits when the system in question is being automatically trained by another software system. Automating the process would require a strong understanding but would have the advantage of removing random bias that humans tend to add to training.

Target Scoring         
March 8th, 2008         
 1st Quad        
 0 7 8 9 10 X Total Score Avg. Score
 0 0 4 5 2 0 11 97 8.82
 2nd Quad        
 0 7 8 9 10 X   
 2 1 2 4 0 1 10 79 7.90
 3rd Quad        
 0 7 8 9 10 X   
 0 0 2 1 4 0 7 65 9.29
 4th Quad        
 0 7 8 9 10 X   
 0 0 0 1 1 0 2 19 9.50
Totals 2 1 8 11 7 Bias Up 12 260 8.88
      Bias Left -4

Machine Uplift

Blog Archive

About Me

Thursday, July 9, 2009

Thoughts about Morphological Analysis

Tuesday, April 14, 2009

David Merrill demos Siftables | Video on TED.com

Sunday, March 8, 2009

Measuring NLP is like Marksmanship

Wednesday, February 4, 2009

Open Source Text Analytics