Thursday, July 9, 2009

Thoughts about Morphological Analysis

This morning I was thinking about the process of morphological analysis. In natural language processing the goal of this analysis is to classify each token into a type of morpheme which is meta data for the token that aids in determining the actual part-of-speech of the token. Advanced parsers offer noun-grouping and occasionally a priori entity identification.

The question came to my mind, how am I doing this process in my head? I believe that the process is ongoing, token by token. I don't read the entire sentence then perform a careful analysis and then conclude the meaning. I am getting the meaning as I read, token by token (sort of). Try this experiment:

  • First, slowly read a sentence one word at a time. If you can, cover the sentence up so that you can't see what is coming and reveal each token one at a time.

  • Then read another sentence you have not seen yet from the middle and guess the meaning. Again try to cover portions of it.


What I've found is that with the first method, things make sense. I don't have any trouble understanding the sentence even if I read very slowly. With the second method I have a lot of trouble guessing what the meaning is for many tokens and end up reading back to the beginning of the current clause. My hunch is that a random access approach to morphology is therefore suspect. That, in fact, understanding of a sentence requires it be treated as a stream and that the decision process starts with the first token and ends with the last. The logical order of the text is the key to understanding it. If you come across the word POLISH, it's meaning (noun (surface cleaner), verb (applying surface cleaner), proper noun (Polish Sausage)) can be discerned by the words preceding it. A statistical analysis of how often POLISH is used as a verb or a noun may be helpful but is not a great starting place as that bias is meaningless if the context suggests otherwise. So why introduce bias at all in the first place?

I consider the problem similar to Texas Hold'em. You get a hand with two cards. Is it a winning hand? You can't know for sure at the start of the round, even with a pair of Aces. Cards are revealed over time and these help you determine the absolute strength of your hand. As we read a sentence there is an expected flow. This expected flow goes beyond diction. It is beyond spelling. It allows us to understand neologisms.

Take for example the LOL Cats. The captions are perfect examples of sentences that break the rules of language so profoundly that they make parents and English teachers cringe. Shakespeare himself spins in his grave. In spite of these critics there are many fans who view the pictures and laugh at the words; a joke having been communicated to them quite clearly. The tokens have chaotic properties given that there are no rules for their formation. They are impressionistic and often derived from slang to start with. A statistical analysis of tokens like "haz" and "mezzzzzzd" or "caturday" likely will produce just about nothing useful. (Though if you, dear reader, have a process that can make sense of tokens like "caturday" using statistical methods I would LOVE to know about it and blog on that achievement.)

Let's take a look at an example where there are many words not familiar to English speakers.

Ne proviso al vi trezorojn sur la tero


In the above clause we have an extreme example of neologism. It is, in fact, written in Esperanto. It is well formed and logical and yet I cannot read it. I can sound out the words and guess at what they may relate to in English but I can't be sure. Is "ne" equivalent to "no"? Let's take an example that will have no equivalent comparison to Romantic language.

ichi go ichi e


The quote is in Japanese and means "one meeting, one chance." The reason I can't figure these out is because there is nothing grounding my reading. I have to know something structurally about the sentence that will clue me into what it means and what each token represents. So, even though there are words that are unfamiliar to me in the LOL Cats captions there are familiar anchors upon which I can begin to divine their meaning. Further, the words, while stylized, do have a connection back to the language in that often they sound like a word that would fit into the sentence at that point. So what we are seeing is not a random collection of tokens but a hidden order or pattern. The reason why is that language is about transmitting concepts and relationships. Objects and actions and their relationship are what we are communicating no matter the language or how abused the syntax is. There are people who have a skill at naturally understanding non-standard use of language. Often language rules are broken severely in music.

I often bring up Rap music because it is where the rules of language are most severely tested and whre the most interesting analysis is for me (even though I don't particularily like the music or the message.) Frankly, if computer software gets to the point of decrypting Rap (aka "hip hop") that will be a major milestone in my humble opinion. The structure is often dictated by a need for vocal rhythm as this kind of music is sung not to a pitch but to a beat provided by the background music. Taken out of the context of the rhythm, rap lyrics are difficult to analyze. Often what is needed from rap lyrics is a rapid projection of sentence fragments that are in sequence and that sequence itself gives the relationship between the concepts and thus the ultimate meaning. Here is a humorous "translation" of a rap song that has been floating around the Internet for a long time. The lyrics are from the artist known as B.I.G. Notorious from his song "One More Chance."

First things first, I poppa, freaks all the honeys
Dummies - playboy bunnies, those wantin’ money


And the suggested translation...

As a general rule, I perform lewd acts with women of all kinds, including but not limited to those with limited intellect, nude magazine models, and prostitutes.


While I cannot condone the treatment of women by B.I.G. Notorious in his music it does give us an interesting problem for analysis. The point here is that the mechanics are more important than the actual meaning. The analysis transcends the creative use of language and the invention of new grammatical structures and neologisms such as "poppa" and the colloquialism "wantin'" along with the fragmentary clauses. The analysis depends upon an expected structure with enough grounding back into the original language to provide familiarity that aids in the decryption. Upon first hearing the song one might not know what "freaks" means in the context of this sentence but, structurally we are hoping for a verb! True understanding of the sentence requires cultural knowledge. Mechanical understanding, however, does not and that is what we are expecting to use natural language processing for.

Going back to my original concept we can look at a word like "freaks" and see statistically it is most likely a noun-plural. Most morphological analysis stops right there. WordNet will tell you it means "addict, monster or to lose one's nerve" which is not what the Big Poppa was trying to say. WordNet is not down with the rap!

Without knowing what the cultural usage means I do see that Poppa --> Freaks --> (honeys, dummies, playboy bunnies, those wantin' money). The relationship between subject(s) and object through the verb are more important than the actual meaning of the words and go beyond what I would call Standard English. From a computer science perspective I see five entities related through one action. Four of the entities are grouped and related by being set members.

My conclusion from this train of thought is that analysis, if on a token by token basis, must not be some random analysis of each token but one that starts at the beginning of the sentence and heavily determines the nature of the current token by it's neighbors which were also determined this way.

I believe the first word in a sentence is a special case. Going back to the Texas Hold'em analogy; this word is your initial hand. You really don't have a lot of information. The better constructed the sentence and the greater your understanding of the language then the more likely you are to understand the use of the word. If the first word is "Polish" which is naturally capitalized at the start of a sentence we have no idea which sense it may be (and even if statistically it is most likely to be a verb, how does that help us?) If the next word is "the" however we have nailed it as a verb. This is because "the" is one of those grounding terms. If the third word is "xyzzy" we have nothing a priori to tell us what it is. However because of the previous word we are certain it is either a noun or the start of a noun group. This analysis hinges on prior discovery in the context. If it was in the middle of a noun group looking at the left and right window might also produce no further useful evidence if the other terms are equally unique and not prviously categorized. Starting from the beginning of the sentence, however, does give us a clue. We already have a verb so we are looking for a subject.

So many systems treat these as edge cases and as "noise" but I think they actually point in the direction systems must take. Language is living and growing and changing. Systems have to expect that and be built with that in mind. When you look at language defining corpora like TreeBank with over 100 different parts of speech you have to wonder if this level of detail is useful because what ends up happening is the parts of speech become bound to a domain, when in fact actual usage is much more dynamic and unbounded. My conclusion is that morphology has to be macro in detail. It has to not be concerned with a priori assumptions based upon statistical use, but at the same time it needs some grounding in structure when dealing with general input. Clearly this is not an approach that will handle everyone's needs. I think for General NLP, however, this philosophy will provide avenues for a variety of robust approaches.

37 comments:

shetech said...

The gauntlet has been thrown. Now I must come up with a statistical breakdown of "caturday". Nicely done.

心情差 said...

凡是遇到困擾的問題,不要把它當作可怕的,討厭的,無奈的遭遇,而要把它當作歷練、訓練和幫助。........................................

qusa said...

IT IS A VERY NICE SUGGESTION, THANK YOU LOTS! ........................................

惠蘋惠蘋 said...

打手槍打飛機巨乳巨奶女優大奶性交性愛淫蕩淫慾淫亂淫婦淫妹淫叫淫水淫女情慾情色做愛限制級波霸口交18禁貼圖寫真視訊援交露點爆乳潮吹裸體裸照裸女愛愛

志文志文 said...

你的部落格很棒,我期待更新喔........................................

淑純yajairac_tai0731 said...

看到你的好文章真是開心 加油囉.......................................

Rosalind治男Garney火吟 said...

Of two evils choose the least.......................................................

G702aynelleKress0 said...

人生的價值以及他的快樂,都在於他有能力看重自己的生存..................................................

勝傑懿綺 said...

快下班囉~來幫你加油~~........................................

蘇pet0701em_halvorsen said...

I do like ur article~!!! ........................................

佩政 said...

I love readding, and thanks for your artical.

啟佐 said...

沒有友情,人生何樂?.............................................

熙辰 said...

一夜聊天室女同聊天聊天室二愛聊天室苗栗聊天66k聊天69性交6k 聊天至6k 聊天館 6k6k天室6k館777貼圖區77p2 影片網77P2P77p2p277p2p本土85ccst街影城2010問題小棧0401影音視訊非會員18h漫貼圖區18p2p帳密18p2p帳號18p2p帳號密碼18r 禁小說18r禁小說18r禁影片18tw台灣18tw情色文學18us線上影片18x免費女優火辣視訊薄紗主播網友自拍露點圖18成人成人韭南籽

于呈均名 said...

Poverty tries friends.......................................................

burtong said...

希望是風雨之夜所現之曉霞 ..................................................

張孟勳 said...

你不能左右天氣,但你可以改變心情..................................................................

欣幸 said...

失去金錢的人,失去很多;失去朋友的人,失去更多;失去信心的人,失去所有。..................................................................

麗芬 said...

HELLO~幫你推個文^^......................................................................

李哲維 said...

與人相處不妨多用眼睛說話,多用嘴巴思考,....................................................................

宛真宛真 said...

死亡是悲哀的,但活得不快樂更悲哀。.................................................................

育財育財育財 said...

成熟,就是有能力適應生活中的模糊。.................................................................

芸茂芸茂 said...

要照顧身體歐~保重..................................................................

雲亨雲亨雲亨 said...

來看你了~心在、愛在、牽掛在,幸福才會繁衍不息^^..................................................

anthonyjensen張anthonyjensen欣虹 said...

喜歡看大家的文章,每篇都是一個故事,都是一種心情~~祝大家開心愉快............................................................

姿柯瑩柯dgdd憶曾g智曾 said...

人生中最重要的是要自尊、自愛、自立、自強、自信。..................................................

吳婷婷 said...

在莫非定律中有項笨蛋定律:「一個組織中的笨蛋,恆大於等於三分之二。」. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

紹函紹函 said...

百發百中不是一試就成的。..................................................

謝文新李怡君 said...

每天都是新的心情~~希望都是好心情!!!!............................................................

黃書豪黃書豪黃書豪 said...

人生中最好的禮物就是屬於自己的一部份..................................................

治冠霖士 said...

喜歡自己的另一層意義是「接納自己」。..................................................

萬建彰宇 said...

Lets cross the bridge when we come to it............................................................

韋陳富 said...

人生有如洶湧的波濤,如果沒有岩石的阻擋,怎能激起美麗的浪花?......................................................

熙筠銘筠銘筠銘辰 said...

我是天山,等待一輪明月。......................................................................

峻胡邦慧v帆 said...

人生像一杯茶,若一飲而盡,會提早見到杯底..................................................

怡靜怡靜怡靜怡雯 said...

等很久了 謝謝你的用心............................................................

SadeRa盈君iford0412 said...

人生中最好的禮物就是屬於自己的一部份............................................................

偉DimpleHolloway043昀 said...

我新來的~大家可交個朋友嗎(・ˍ・).....................................................