Thursday, July 9, 2009

Thoughts about Morphological Analysis

This morning I was thinking about the process of morphological analysis. In natural language processing the goal of this analysis is to classify each token into a type of morpheme which is meta data for the token that aids in determining the actual part-of-speech of the token. Advanced parsers offer noun-grouping and occasionally a priori entity identification.

The question came to my mind, how am I doing this process in my head? I believe that the process is ongoing, token by token. I don't read the entire sentence then perform a careful analysis and then conclude the meaning. I am getting the meaning as I read, token by token (sort of). Try this experiment:

  • First, slowly read a sentence one word at a time. If you can, cover the sentence up so that you can't see what is coming and reveal each token one at a time.

  • Then read another sentence you have not seen yet from the middle and guess the meaning. Again try to cover portions of it.

What I've found is that with the first method, things make sense. I don't have any trouble understanding the sentence even if I read very slowly. With the second method I have a lot of trouble guessing what the meaning is for many tokens and end up reading back to the beginning of the current clause. My hunch is that a random access approach to morphology is therefore suspect. That, in fact, understanding of a sentence requires it be treated as a stream and that the decision process starts with the first token and ends with the last. The logical order of the text is the key to understanding it. If you come across the word POLISH, it's meaning (noun (surface cleaner), verb (applying surface cleaner), proper noun (Polish Sausage)) can be discerned by the words preceding it. A statistical analysis of how often POLISH is used as a verb or a noun may be helpful but is not a great starting place as that bias is meaningless if the context suggests otherwise. So why introduce bias at all in the first place?

I consider the problem similar to Texas Hold'em. You get a hand with two cards. Is it a winning hand? You can't know for sure at the start of the round, even with a pair of Aces. Cards are revealed over time and these help you determine the absolute strength of your hand. As we read a sentence there is an expected flow. This expected flow goes beyond diction. It is beyond spelling. It allows us to understand neologisms.

Take for example the LOL Cats. The captions are perfect examples of sentences that break the rules of language so profoundly that they make parents and English teachers cringe. Shakespeare himself spins in his grave. In spite of these critics there are many fans who view the pictures and laugh at the words; a joke having been communicated to them quite clearly. The tokens have chaotic properties given that there are no rules for their formation. They are impressionistic and often derived from slang to start with. A statistical analysis of tokens like "haz" and "mezzzzzzd" or "caturday" likely will produce just about nothing useful. (Though if you, dear reader, have a process that can make sense of tokens like "caturday" using statistical methods I would LOVE to know about it and blog on that achievement.)

Let's take a look at an example where there are many words not familiar to English speakers.

Ne proviso al vi trezorojn sur la tero

In the above clause we have an extreme example of neologism. It is, in fact, written in Esperanto. It is well formed and logical and yet I cannot read it. I can sound out the words and guess at what they may relate to in English but I can't be sure. Is "ne" equivalent to "no"? Let's take an example that will have no equivalent comparison to Romantic language.

ichi go ichi e

The quote is in Japanese and means "one meeting, one chance." The reason I can't figure these out is because there is nothing grounding my reading. I have to know something structurally about the sentence that will clue me into what it means and what each token represents. So, even though there are words that are unfamiliar to me in the LOL Cats captions there are familiar anchors upon which I can begin to divine their meaning. Further, the words, while stylized, do have a connection back to the language in that often they sound like a word that would fit into the sentence at that point. So what we are seeing is not a random collection of tokens but a hidden order or pattern. The reason why is that language is about transmitting concepts and relationships. Objects and actions and their relationship are what we are communicating no matter the language or how abused the syntax is. There are people who have a skill at naturally understanding non-standard use of language. Often language rules are broken severely in music.

I often bring up Rap music because it is where the rules of language are most severely tested and whre the most interesting analysis is for me (even though I don't particularily like the music or the message.) Frankly, if computer software gets to the point of decrypting Rap (aka "hip hop") that will be a major milestone in my humble opinion. The structure is often dictated by a need for vocal rhythm as this kind of music is sung not to a pitch but to a beat provided by the background music. Taken out of the context of the rhythm, rap lyrics are difficult to analyze. Often what is needed from rap lyrics is a rapid projection of sentence fragments that are in sequence and that sequence itself gives the relationship between the concepts and thus the ultimate meaning. Here is a humorous "translation" of a rap song that has been floating around the Internet for a long time. The lyrics are from the artist known as B.I.G. Notorious from his song "One More Chance."

First things first, I poppa, freaks all the honeys
Dummies - playboy bunnies, those wantin’ money

And the suggested translation...

As a general rule, I perform lewd acts with women of all kinds, including but not limited to those with limited intellect, nude magazine models, and prostitutes.

While I cannot condone the treatment of women by B.I.G. Notorious in his music it does give us an interesting problem for analysis. The point here is that the mechanics are more important than the actual meaning. The analysis transcends the creative use of language and the invention of new grammatical structures and neologisms such as "poppa" and the colloquialism "wantin'" along with the fragmentary clauses. The analysis depends upon an expected structure with enough grounding back into the original language to provide familiarity that aids in the decryption. Upon first hearing the song one might not know what "freaks" means in the context of this sentence but, structurally we are hoping for a verb! True understanding of the sentence requires cultural knowledge. Mechanical understanding, however, does not and that is what we are expecting to use natural language processing for.

Going back to my original concept we can look at a word like "freaks" and see statistically it is most likely a noun-plural. Most morphological analysis stops right there. WordNet will tell you it means "addict, monster or to lose one's nerve" which is not what the Big Poppa was trying to say. WordNet is not down with the rap!

Without knowing what the cultural usage means I do see that Poppa --> Freaks --> (honeys, dummies, playboy bunnies, those wantin' money). The relationship between subject(s) and object through the verb are more important than the actual meaning of the words and go beyond what I would call Standard English. From a computer science perspective I see five entities related through one action. Four of the entities are grouped and related by being set members.

My conclusion from this train of thought is that analysis, if on a token by token basis, must not be some random analysis of each token but one that starts at the beginning of the sentence and heavily determines the nature of the current token by it's neighbors which were also determined this way.

I believe the first word in a sentence is a special case. Going back to the Texas Hold'em analogy; this word is your initial hand. You really don't have a lot of information. The better constructed the sentence and the greater your understanding of the language then the more likely you are to understand the use of the word. If the first word is "Polish" which is naturally capitalized at the start of a sentence we have no idea which sense it may be (and even if statistically it is most likely to be a verb, how does that help us?) If the next word is "the" however we have nailed it as a verb. This is because "the" is one of those grounding terms. If the third word is "xyzzy" we have nothing a priori to tell us what it is. However because of the previous word we are certain it is either a noun or the start of a noun group. This analysis hinges on prior discovery in the context. If it was in the middle of a noun group looking at the left and right window might also produce no further useful evidence if the other terms are equally unique and not prviously categorized. Starting from the beginning of the sentence, however, does give us a clue. We already have a verb so we are looking for a subject.

So many systems treat these as edge cases and as "noise" but I think they actually point in the direction systems must take. Language is living and growing and changing. Systems have to expect that and be built with that in mind. When you look at language defining corpora like TreeBank with over 100 different parts of speech you have to wonder if this level of detail is useful because what ends up happening is the parts of speech become bound to a domain, when in fact actual usage is much more dynamic and unbounded. My conclusion is that morphology has to be macro in detail. It has to not be concerned with a priori assumptions based upon statistical use, but at the same time it needs some grounding in structure when dealing with general input. Clearly this is not an approach that will handle everyone's needs. I think for General NLP, however, this philosophy will provide avenues for a variety of robust approaches.