Monday, August 18, 2008

Getting rid of Google from news posts

I just noticed that google was very prominent in the news bar at the left. However the articles weren't very relevent to data mining, text analytics, unstructured data analytics or robotics. So I tried to add "-google" to the list. Guess what - no joy. Then I changed the query to the company I work for "digital reasoning systems" and guess what, Google comes up! Then I tried "Kiva Systems" and they come up again. Put in robotics and they go away. Put in an excluder for any term in the headline of one of the returned results and it goes away. Google is cheating on their own system to make it so you can't exlude news about their company even thought it is irrelevent! They also have decided to associate themselves with two unrealated companies. I don't know but it smells dishonest to me. Maybe someone can explain to me why these searches come up with Goodle so prominantly.

Startup Kiva's New (Robotic) Approach to Order Fulfillment - Brightcove

Startup Kiva's New (Robotic) Approach to Order Fulfillment - Brightcove

The Kiva CEO talks about warehouse automation. Its a very interesting interview. I didn't get to meet him when I visited Kiva but perhaps on my next visit. The video nicely shows the drive units at work. While they look small and weak, trust me, they will take your leg in second if you wander onto their pathways! There are safety devices of course but I'm not willing to trust life and limb to them! I love these systems.

One other thing you get to see is the software they use to help the robots get organized. The whole system isn't shown and I don't think you could casually show it in such an interview. I really enjoyed using it and it clearly has evovled nicely.

Thursday, August 14, 2008

Punctual Punctuation

I've been looking at the output of a text processor/pos tagger and noticed that there is a whole class of error I haven't been looking for but should and that is punctuation. It makes a big difference when predicting the start and end of a sentence on the tagging of POS. Especially with words that may be nouns or verbs depending on their context.

Mainly the biggest problem i see is in the handling of the period which for Americans at least appears as the decimal separator for numbers, in abbreviations and at the end of sentences. So it becomes important to discover these three cases accurately. I noticed in some cases that decimal was being concatenated with the prior word (clearly a bug with differentiating abbreviations from words!)

I believe I will treat this kind of error in the same class as lemma error in that the POS prediction is discounted and error is attributed to form. I've been working with this kind of measurement for a long time and I'm surprised I hadn't noticed this class of error before. However it is good to see it now and incorporate it into my methodology.

Monday, August 4, 2008

Perfection in NLP

Is there perfection in NLP? Let's take, for example, part of speech analysis. For many texts such as the one you are reading, the POS is easily discernible. Occasionally there will be some odd usage that is hard to classify but for the most part it is clear what the part of speech is. There are basically 60 or so parts of speech in English that are worth tracking. University of Pennsylvania's Treebank has around 110 with some being odd combinations of part of speech.

However you run into trouble with text messages and forum postings:

"C U L8r"
"That show was def"
"Another Halo Game? Interesting."

You have further trouble when you look at these musical lyrics (from Weird Al):

"What y'all wanna do?
Wanna be hackers? Code crackers? Slackers
Wastin' time with all the chatroom yakkers?
9 to 5, chillin' at Hewlett Packard?
Workin' at a desk with a dumb little placard?
Yeah, payin' the bills with my mad programming skills
Defraggin' my hard drive for thrills
I got me a hundred gigabytes of RAM
I never feed trolls and I don't read spam"

Those aren't so bad but now try to deconstruct "Welcome to the Terrordome":

"I got so much trouble on my mind
I refuse to lose
Here's your ticket
Hear the drummer get wicked
The crew to you to push the back to black
Attack so I sat and japped
Then slapped the mac
Now Im ready to mike it
(you know I like it) huh
Hear my favoritism roll oh
Never be a brother like to go solo
Lazer, anastasia, maze ya
Ways to blaze your brain and train ya
The way I'm livin, forgiven
What Im givin up"

OK so is it cheap racial demagoguery or social conscience writ large? The fact is that it makes sense to some people who can decode it. However from a machine learning point of view its a mess! And yet here we have written communication. It represents the problem of attaining perfection. I will say that I had as much trouble understanding what Chuck D was saying in the above lyrics as I did reading a macro economics text book. However the more I knew the more both made sense. The start, of course, is knowing where to begin and with just a simple part of speech analysis I think it is asking a lot of machines, heck people even, to get anywhere with text they don't understand. I'm reminded of the Chinese Room Argument.

We all know that these are the challenges. There is no arguing that language captured in text runs a huge spectrum of quality and consistency. The fact is that people who probably would score poorly in POS identification for the most part (guilty as charged here) we still manage to understand what we are reading and have the ability to associate information, spot entities of particular types and generally get by. Sometimes we miss things that are communicated via sarcasm - like that girl I was staring at who said "why don't you just take a picture." I didn't realize she wasn't serious... Sheesh.

Great so lets assume that we accept these problems and decide to identify POS anyway. We would like a measure of how "good" the system is and typically this means taking some text that is "regular" without any weird word plays, lyrics, noise or any other nonsense. I have three text files I use often for this purpose. One is a very dry description of plant maintenance. The other is a post game report on a Red Sox win (oh shut up they were losers all my life, let me enjoy the recent string of wins.) Finally I have a review of the Halo 3 game. I have this one because it is full of weird made up grammar. Gamers and programmers seem to prefer to invent POS for things instead of using "standard English." You know what I mean. "Spreadsheet these numbers for me." If you don't see the made up usage in the previous sentence you need to check out this blog.

I definitely suggest using multiple documents in different voices. Include a lot of things you are interested in - especially if you do entity extraction. If you are interested in people then a sports report is great because multiple people get mention in each post-game report. Find things that interest you and your work and begin there. Keep it short. Creating a reference file is a tedious process and long documents are hard to do because the tedium is a killer. At least for me. Maybe there is a way to make it fun. Maybe I should use Penthouse Forum articles instead.

The next problem is that the words internal to words that are co-located generally aren't their normal part of speech. "Welcome to the Learn to Fly Website!" "Fly" is part of a proper noun so is it a verb? The hell it is. That co-location is a noun. On a token by token basis MAYBE that word is a verb but together with it's neighboring tokens it is part of a noun.

So when do you test for part of speech, before determining co-locations or after? You probably want to do BOTH. However anything prior to determining co-locations is for debugging purposes only. You need to know what is going into your process for determining co-locations. Corrections here are important. For analytical purposes however you will score your POS after the co-locations are figured out.

Here is the rub. Your co-locations might not be correct. Some will be good, some will not. So how do you separate this problem from POS analysis? In my opinion you take the output and put one token or co-location per line along with the part of speech and as you score you check for the word count. If they match then you compare from the output to your idealized reference set. If they don't then you keep reading until the counts match. Each mismatch is a bad co-location. Keep score of good versus bad co locations separately and if its bad don't bother checking the part of speech. Just don't count it. In my opinion this produces the most even and fairest analysis.

Keep track of the types of POS you are using and the number right and wrong for each one. It isn't hard to calculate the f-measure for each part of speech but you need a statistically significant number of examples in order to get a reasonable number. There are a number of problems with F-Measure so its not really clear that it is helpful as a measure here. A pure % right calculation is also of questionable utility. However they will do for a start. I have friends working on a new measure that should help in this analysis.

As for the problematic cases such as lyrics and text captured from forums I suggest giving a hand at trying to determine the collocations and part of speech by hand first. Then try your process for POS prediction and see how it does. You will find a number of issues. How do you score where someone has inadvertently or deliberately used bad diction? In this article the author talks about the "Obama Affect" when I am pretty certain he meant the "Obama Effect." If we discount bad spelling, diction and grammar in text then how can we measure how accurate we are in modes of communication totally defined by this such as SMS messages? I think we can't do this kind of discounting. We have to pick, if not a POS, at least a role that the token is performing. Even a smiley face has a role, be it decorative or meaningful at a meta level.