Tuesday, November 18, 2008

Can you use adjectives to infer nouns in entity extraction?

I will limit this case to location extraction since this is a common case and one where I seem to find a lot of confusion. To give away the punch line the definitive answer is no, you can't use adjectives. The full answer though is more subtle than that. Read on, read on.

The use of adjectives for entity extraction looking for locations has many flaws. Semantically, a location can only be a noun. The theory is that in a sentence such as "The ship was beached along the Boston Coastline" that the word Boston is the location. Boston in this context is an adjective. There are two cases where this is provably bad and incorrect and one case where it just simply is misleading.

Case 1: Location Name has Multiple Meanings:

There are location names that are perfectly valid that have additional common meaning. Bad, Iran or Train, Germany or Hit, Iraq or Car, Romania are just 4 examples of hundreds worldwide. If we take the adjective case then when we find "train station" or "bad karma" or "I'd hit that" (ok that's a predicate, you got me - I just wanted to use that phrase!) or "car wash" then we can infer locations. This is clearly absurd. For names like Boston that seem to be unique we find that they have additional meaning. Boston is a transliteral variant of the Uzbek word Baystan. What it means I have no clue.

Case 2: Adjective looks like a location but isn't

There are many of these. Boston cream pie, Boston Bruins, Boston Marathon, New York minute, etc. No discussion really needed here.

Case 3: Adjective is sort of a location but is misleading

Boston Marathon also falls in this example. The Boston Marathon does indeed terminate in Boston. In fact less than 5% of the race is in the city of Boston. The other 95+% runs through many places. Along the way you find another misleading place called Boston College at the top of Heartbreak Hill. The problem there is that Boston College is not in Boston. It is in Chestnut Hill.

Further, these types can be used as locations. Ex. I am going to the Boston Marathon or I am heading to Boston City Hall. They are destination - but they are more specific than just the city.

What is the correct method of dealing with these cases? A good semantic system not only is aware of the parts of speech of individual words, it recognizes important co-locations of words. So instead of seeing:

The: Determiner
Boston: Adjective
Coastline: noun
is: predicate
beautiful: modifier
.: end of sentence

It would see

The: determiner
Boston Coastline: noun
is: predicate
beautiful: modifier
.: end of sentence

In an ideal solution Boston Coastline would then be in your gazetteer with correct coordinate representation. While the software I work on does not handle linear or polygonal targets and could not represent even a bounding box for this type of target today it would not be terribly difficult to implement. It is a simple matter of programming. Entity extraction software that correctly groups terms like "Boston Coastline" are advanced. That is what our software does. That is a significant problem to solve. Once done it is simply a matter of having a properly populated gazetteer and an ability to distinguish in the database between point and area features.

Finally a good semantic solution looks not only look at the words but how they are used in context.

I like to travel to Train, Germany.


I like to train in Germany.

In the first case we definitely are talking about Train as a noun and location. In the second case it is a predicate and not a location. Imagine further that this is all caps data and it becomes an even more difficult problem.

So avoid the trap of thinking that placename lables are useful without having an understanding of their groupings, parts of speech and context.

Monday, August 18, 2008

Getting rid of Google from news posts

I just noticed that google was very prominent in the news bar at the left. However the articles weren't very relevent to data mining, text analytics, unstructured data analytics or robotics. So I tried to add "-google" to the list. Guess what - no joy. Then I changed the query to the company I work for "digital reasoning systems" and guess what, Google comes up! Then I tried "Kiva Systems" and they come up again. Put in robotics and they go away. Put in an excluder for any term in the headline of one of the returned results and it goes away. Google is cheating on their own system to make it so you can't exlude news about their company even thought it is irrelevent! They also have decided to associate themselves with two unrealated companies. I don't know but it smells dishonest to me. Maybe someone can explain to me why these searches come up with Goodle so prominantly.

Startup Kiva's New (Robotic) Approach to Order Fulfillment - Brightcove

Startup Kiva's New (Robotic) Approach to Order Fulfillment - Brightcove

The Kiva CEO talks about warehouse automation. Its a very interesting interview. I didn't get to meet him when I visited Kiva but perhaps on my next visit. The video nicely shows the drive units at work. While they look small and weak, trust me, they will take your leg in second if you wander onto their pathways! There are safety devices of course but I'm not willing to trust life and limb to them! I love these systems.

One other thing you get to see is the software they use to help the robots get organized. The whole system isn't shown and I don't think you could casually show it in such an interview. I really enjoyed using it and it clearly has evovled nicely.

Thursday, August 14, 2008

Punctual Punctuation

I've been looking at the output of a text processor/pos tagger and noticed that there is a whole class of error I haven't been looking for but should and that is punctuation. It makes a big difference when predicting the start and end of a sentence on the tagging of POS. Especially with words that may be nouns or verbs depending on their context.

Mainly the biggest problem i see is in the handling of the period which for Americans at least appears as the decimal separator for numbers, in abbreviations and at the end of sentences. So it becomes important to discover these three cases accurately. I noticed in some cases that decimal was being concatenated with the prior word (clearly a bug with differentiating abbreviations from words!)

I believe I will treat this kind of error in the same class as lemma error in that the POS prediction is discounted and error is attributed to form. I've been working with this kind of measurement for a long time and I'm surprised I hadn't noticed this class of error before. However it is good to see it now and incorporate it into my methodology.

Monday, August 4, 2008

Perfection in NLP

Is there perfection in NLP? Let's take, for example, part of speech analysis. For many texts such as the one you are reading, the POS is easily discernible. Occasionally there will be some odd usage that is hard to classify but for the most part it is clear what the part of speech is. There are basically 60 or so parts of speech in English that are worth tracking. University of Pennsylvania's Treebank has around 110 with some being odd combinations of part of speech.

However you run into trouble with text messages and forum postings:

"C U L8r"
"That show was def"
"Another Halo Game? Interesting."

You have further trouble when you look at these musical lyrics (from Weird Al):

"What y'all wanna do?
Wanna be hackers? Code crackers? Slackers
Wastin' time with all the chatroom yakkers?
9 to 5, chillin' at Hewlett Packard?
Workin' at a desk with a dumb little placard?
Yeah, payin' the bills with my mad programming skills
Defraggin' my hard drive for thrills
I got me a hundred gigabytes of RAM
I never feed trolls and I don't read spam"

Those aren't so bad but now try to deconstruct "Welcome to the Terrordome":

"I got so much trouble on my mind
I refuse to lose
Here's your ticket
Hear the drummer get wicked
The crew to you to push the back to black
Attack so I sat and japped
Then slapped the mac
Now Im ready to mike it
(you know I like it) huh
Hear my favoritism roll oh
Never be a brother like to go solo
Lazer, anastasia, maze ya
Ways to blaze your brain and train ya
The way I'm livin, forgiven
What Im givin up"

OK so is it cheap racial demagoguery or social conscience writ large? The fact is that it makes sense to some people who can decode it. However from a machine learning point of view its a mess! And yet here we have written communication. It represents the problem of attaining perfection. I will say that I had as much trouble understanding what Chuck D was saying in the above lyrics as I did reading a macro economics text book. However the more I knew the more both made sense. The start, of course, is knowing where to begin and with just a simple part of speech analysis I think it is asking a lot of machines, heck people even, to get anywhere with text they don't understand. I'm reminded of the Chinese Room Argument.

We all know that these are the challenges. There is no arguing that language captured in text runs a huge spectrum of quality and consistency. The fact is that people who probably would score poorly in POS identification for the most part (guilty as charged here) we still manage to understand what we are reading and have the ability to associate information, spot entities of particular types and generally get by. Sometimes we miss things that are communicated via sarcasm - like that girl I was staring at who said "why don't you just take a picture." I didn't realize she wasn't serious... Sheesh.

Great so lets assume that we accept these problems and decide to identify POS anyway. We would like a measure of how "good" the system is and typically this means taking some text that is "regular" without any weird word plays, lyrics, noise or any other nonsense. I have three text files I use often for this purpose. One is a very dry description of plant maintenance. The other is a post game report on a Red Sox win (oh shut up they were losers all my life, let me enjoy the recent string of wins.) Finally I have a review of the Halo 3 game. I have this one because it is full of weird made up grammar. Gamers and programmers seem to prefer to invent POS for things instead of using "standard English." You know what I mean. "Spreadsheet these numbers for me." If you don't see the made up usage in the previous sentence you need to check out this blog.

I definitely suggest using multiple documents in different voices. Include a lot of things you are interested in - especially if you do entity extraction. If you are interested in people then a sports report is great because multiple people get mention in each post-game report. Find things that interest you and your work and begin there. Keep it short. Creating a reference file is a tedious process and long documents are hard to do because the tedium is a killer. At least for me. Maybe there is a way to make it fun. Maybe I should use Penthouse Forum articles instead.

The next problem is that the words internal to words that are co-located generally aren't their normal part of speech. "Welcome to the Learn to Fly Website!" "Fly" is part of a proper noun so is it a verb? The hell it is. That co-location is a noun. On a token by token basis MAYBE that word is a verb but together with it's neighboring tokens it is part of a noun.

So when do you test for part of speech, before determining co-locations or after? You probably want to do BOTH. However anything prior to determining co-locations is for debugging purposes only. You need to know what is going into your process for determining co-locations. Corrections here are important. For analytical purposes however you will score your POS after the co-locations are figured out.

Here is the rub. Your co-locations might not be correct. Some will be good, some will not. So how do you separate this problem from POS analysis? In my opinion you take the output and put one token or co-location per line along with the part of speech and as you score you check for the word count. If they match then you compare from the output to your idealized reference set. If they don't then you keep reading until the counts match. Each mismatch is a bad co-location. Keep score of good versus bad co locations separately and if its bad don't bother checking the part of speech. Just don't count it. In my opinion this produces the most even and fairest analysis.

Keep track of the types of POS you are using and the number right and wrong for each one. It isn't hard to calculate the f-measure for each part of speech but you need a statistically significant number of examples in order to get a reasonable number. There are a number of problems with F-Measure so its not really clear that it is helpful as a measure here. A pure % right calculation is also of questionable utility. However they will do for a start. I have friends working on a new measure that should help in this analysis.

As for the problematic cases such as lyrics and text captured from forums I suggest giving a hand at trying to determine the collocations and part of speech by hand first. Then try your process for POS prediction and see how it does. You will find a number of issues. How do you score where someone has inadvertently or deliberately used bad diction? In this article the author talks about the "Obama Affect" when I am pretty certain he meant the "Obama Effect." If we discount bad spelling, diction and grammar in text then how can we measure how accurate we are in modes of communication totally defined by this such as SMS messages? I think we can't do this kind of discounting. We have to pick, if not a POS, at least a role that the token is performing. Even a smiley face has a role, be it decorative or meaningful at a meta level.

Friday, July 18, 2008

The basics of Text Mining

The basic needs of any text mining effort are:
  • Text Processing (formatting, cleaning, unencoding, encoding, etc.)
  • Determination of co-locations (words that have more meaning together than apart like 'United States of America' is a single concept and should be grouped together.)
  • Determination of parts of speech/role.

You can tell how advanced a system is by what they do with the text before analysis. Digital Reasoning, Attensity and others do all three before analysis. It is a key factor in "exhaustive extraction" and in the creation of advanced structures like associative networks. Without understanding the semantic structure how can one determine the actual meaning of the elements?

I've been getting more and more frustrated with search engines. You have to leap through so many hoops to do the types of searches I've gotten used to with semantics-aware engines. "Remains" is both a predicate and an entity - depending on how it is used. In order to find exactly what I am looking for I have to put in the term and look at what comes back. When I see it is bringing back a lot of incorrect cases I then help it disambiguate the responses by adding in negative examples. So if I am searching on "cold" and got a lot of responses back on low temperature when really I was more interested in the respiratory blockage I would add "-temperature" which is not bad unless some article I might want actually has temperature in the sense of fever ('running a temperature') in which case I would have just filtered it out.

When you are evaluating textual mining solutions it is important to make sure they can provide this kind of functionality. Google is a 90's technology. As simple as it is it does give access to a lot of information but it's simplicity also makes it a lot of work when we are examining large scales of data with subtle requirements in our search. In fact the more important it is to find "The Document" over general documents related to the search topic the more vital a semantic layer becomes.

Consider this - if a vendor is trying to sell you on a solution to determine "customer voice", or as I've heard it referred to elsewhere, sentiment analysis. Ask them how it knows when "sucks" refers to a negative connotation in vacuum cleaner reviews. They may laugh and say that is a minor example but the fact is, if they aren't dealing with it they also aren't dealing with a lot of other factors. Keywords are no longer a useful technology. Semantic understanding is required for subtle detection. Anyone who tells you differently hasn't a leg to stand on.

Thursday, July 17, 2008

Arrrg. Working with other peoples data

The biggest pain with text analytics and data mining is working with other peoples data. Invariably it is all garbage. This file is ASCII, that file is UTF8, this other file has some weird code page, that file is 7bit. It can cause you to pull your hair out. This happens with data from every corporation or even the Federal Government. I was just working with the TraxIntel DB and was having trouble with the analysis missing what I thought was a lot of important information. Of course it was something as simple as just files not formatted the way I was expecting. This should have been expected since TI pulls in data from a lot of different sources. The lessons is, pre-process all of your input and make sure it is formatted exactly the way you want.

I should have known better because when I was working with the public collection of Enron emails they had all sorts of encoding in them that are unique to emails but that aren't text you'd want to analyze. The simple solution was to find plain text versions of the messages. The purist answer of course is to create a filter for various types of potential file formats and fix the input before it hits the semantic analyzers. What a concept.

Anyway, the point of this post is that sometimes when you are debugging this stuff it isn't the high level models and complex software that is broken. "Garbage in means garbage out." A lesson as old as the hills.

Tuesday, July 15, 2008

Artificial = Synthetic

From my perspective, the terms "artificial" and "synthetic" convey the same engineering meaning relative to machine learning. The intelligence of a computer system is derived from a human interpretation, model, of how one believes the learning process evolves. Since this process is bound by this constraint, any derived knowledge can be no better than the underlying algorithm that supports the process. To this end, the application has synthesized the function of the human brain's cognitive capabilities as modeled by the the algorithm. The term "artificial" or “synthetic” simply implies that the learning was done by machine.

Furthermore, I argue that Eliza or any any system that implements a similar fundamental concept are examples of how computers can be programmed to mimic the behavior of a human. I think it is very unfair to claim that Eliza is an AI system. Eliza was simply a program to dupe naive people into believing that the computer was performing an "initial psychiatric interview" with them. It did not offer any form of learning, but rather searched for key words that could be rephrased to answer/create new questions that the user would deem to be plausible. I view Eliza as a teaser to the capabilities that machine learning could offer in the future. Interestingly, I always considered Eliza to be a condemnation of psychotherapy. Based upon comments raised by my blog colleague in our offline discussions, I now recognize the adverse impact that it has had on the perception of AI.

As an engineer, I believe that a fair example of an AI system is a linear/non-linear control mechanism. Such systems are used in various applications, and using real-time data, adapt system performance to ensure that functionality and stability are maintained. A simple example of such a system is a one-step ahead controller. Examples of these systems can be found aboard ships, in machine-based automation, etc.

From many discussions that I have had with my academic peers in sociology and psychology regarding machine learning, I recognize the ongoing philosophical debate that surrounds it (i.e. artificial vs. synthetic). The bottom line is that until the cognitive function of the human brain is mapped, machine intelligence will remain an interpretative representation of this function. Philosophically, I believe that Artificial Intelligence is Synthetic Intelligence.

Monday, July 14, 2008

Artificial vs. Synthetic

Recently around the office there was some debate on the use of artificial intelligence vs. synthetic intelligence. Artifical intelligence implies that the intelligence is not genuine. Syhthetic intelligence suggests the intelligence is not just an imitation but is genuinely a form of intelligence specifically created.

There is an all to short article on the subject at wikipedia. I wish it was quite a bit longer. Its a fresh and healthy debate. While it is philosophical - the whole nature of what we are building is based upon philosophy.

I think for now I am on the synthetic intelligence end of things. The systems I'm seeing are able to perform intellectual tasks that I would do. They don't do them in the same way as I would but they produce similar if not better results and they certainly do it quicker and faster than I do.

Friday, July 11, 2008

Business TN hack job on Digital Reasoning

So I got forwarded this article from Business TN about the company I work for. At first the terrible journalism standards in it made me angry. I used to run a blog that followed the terrible and sensational journalism used in the Natalie Holloway Case. Then I had to sit back and laugh.

My favorite line is an un-cited quote about the CEO's mother (who is a board member) "calling all the shots." Ok lets deconstruct this. First, the poor woman has not been well for many months and has been in treatment. While she is recovered now she pretty much hasn't called any shots for the last year. Second, lets say for a second she was quite influential. What exactly is the problem? Is it that she is a woman? Is it that she is a mother? Really? Come on - how old school is that? It really goes a long way to prove Tim's original assertion that Venture Capital in Middle Tennessee is not really capable of evaluating a good thing when they see it and is driving good business ideas out of this state.

Frankly, if Tim moves the shop I would follow. DRS has great technology that has been proven in the field of battle. It needs to be in the hands of civilians now. Where as we will eventually develop something out of what we are doing with the government, that is a slow process. I have to shrug my shoulders. If that uncredited person actually did say the problem was Tim's mother then there really is no helping the local VC crowd as they are hopelessly short sighted and frankly misogynistic. It is well they weren't named because any female entrepreneurs in state who read their comments would instantly cross them off the list of people they would trust.

The fact that they go un-cited however makes me question the entire article itself however. I wonder how much of that is really just the invention and belief of the author. Now you would be right to say where do I get the right to critisize him when I blog. Well for one, I know bad journalism when I see it. Second I have left open the comments on this blog so if you want to call BS on something I wrote I would gladly engage in that. There are a lot of things in this blog that are opinion and not fact and sometimes I am not as careful to distinguish between them as clearly in my writing as I should.

I'd be glad to hear from anyone in the NLP field that also finds that the VC in their area don't get what they do and don't invest in innovative startups. I am sure it is not just Middle Tennessee. When you do a map of the location of the top 30% of the data mining/text analytics companies they are not surprisingly centered on a short list of hubs.

Thursday, July 10, 2008

Measurement in NLP Development

From the blog post at Digital Reasoning Systems.

I blogged about a cooperative effort between development and QoA that I participated in at my day job at Digital Reasoning. I bring it up here on this blog because it is related to NLP in that an NLP system was developed with the standard measure used to gauge it's success.

In talking to Dr. Kaufman (my co-author on this blog) about it she said that there is a name for that kind of development called The Spiral Design Process. If you have not already visited the DRS blog you should check it out. It is gaining steam with several authors contributing to it. I get to work with some really interesting people so it is nice to see my articles up with theirs.

I think one of the most important lessons is that you should measure early and often. Back in New England as a kid my grandfather was a general contractor. Occasionally he would hire me during the summer in High School to help do finishing work on homes he was building or renovating. He had two pieces of advice for me. One of them was "measure twice, cut once."

Oh and the other was hold the hammer by the lower part of the handle for $5 an hour or up near the head for $0.50 an hour wage.

Monday, June 30, 2008

Zappos gets Kiva Systems drive units

Zappos picked up the Kiva System Drive Units. Zappos already had an incredibly low order to fulfilment cycle. Now with these robots that productivity will increase even faster. The real benefit I see is not only in productivity but also in enabling disabled workers a chance to be employed as the robots do the hard stuff. The article I've linked to also makes an interesting observation. Robots may make it so that people here in the USA can be hired to do these jobs instead of the whole operation shipped overseas. If so that would be remarkable.

Wednesday, June 25, 2008

Do Illiterate People get the Full Effect of Alphabet Soup?

The title comes from a George Carlin joke and in reverence I've used it as a very appropriate title for today's entry.

From the people I spoke with at Text Analytics Summit 2008 it seems that every one gets recall and precsions, some get f-measure and few if any get any other measurement for analyzing the quality of analytics from products. This seems weird to me. First off the f-measure is pretty easy to get. What I find more difficult are defining recall and preceision. In fact it is questions of how to measure those that generally screw people up the most.

Recall: To simplify this consider that a document is full of entities. You have a conceptual set of relevent entities. It is important to make sure that when you go through the document you only find the ones that are actually relevent. For example if you were looking for Populated Place names (PPLs) then you would want to throw out anything that is a personification or adjective. "I'm going to Washington" would be good but "Washington was urged to sign the Kyoto Agreement" would not be. In the second case the entity Washington is the administation of the United States government. So assuming you have identified all of these then the next task is to sum them up. The sum of every hit that you get as a result that matches (with perfect registration) is compared against the total relevent entities and that is your recall. So if there are 10 PPLs and you get 6 of them then your recall is 0.6.

Precision: This is really simple. Take the number of relevent hits you have an divide by all the hits you have. So if you have 6 relevent hits but your total hits are 12 then your precision is 0.5.

Registration: This is where people cheat and fudge numbers. You have to show the instance of the term that was hit to know if you got it right. In the Washington example above if both of those sentences were in the target document then you'd want to know WHICH Washington was picked up. What cheaters will do is note how many Washingtons are relevent and then count the number of hits without checking registration so if there is a false positive it will look like a true positive. Another cheat I've seen is to take any hits on Washington and flatten them - ignoring the counts and just counting that as a true positive. These are real life examples and show you can't just trust the vendor.

So don't let someone scam you with their recall and precision numbers. Ask how they were derived. Don't just accept them as given. Once you have recall and precision then there are two ways you can calculate the f-measure:

1) Unweighted


2) Weighted

With the weighted version you put in a value for b between 0.5 and 1.5 and it shifts from prefering recall to prefering precision. It depends on the individual needs on the analysis you are doing.

The point of this post is that you need to know what goes into making an accurate calculation of f-measure. The fact is that if you have someone doing it for you they have to really understand recall and precision. If you take shortcuts you reduce the benefit of the analysis to the point where you start promoting systems that just don't work. If you rely upon the vendor they are likely to sell you a pack of lies. The best approach is to be knowlege able about how to do the measurement and do it yourself or find someone who is skilled at doing it. In the end the security and comfort you get from validating the f-measure will keep you from losing sleep.

Tuesday, June 24, 2008

Would you buy a used car from this man?

Today, the emphasis on textual data mining is on the breadth of unstructured text that can be reviewed. For example, many tools emphasize the volume of data that can be ingested. However by emphasizing this approach, the quality of the mined data is often ignored. Just because a tool can ingest hundreds of thousands of document within a tractable time period does not mean that the produced results are meaningful, accurate or pertinent. Currently, there are no widely accepted measurement tools that can provide insight to the quality of the mined data, including he integrity of the derived associations, or its usefulness to the end user. Rather, the suppliers of such tools approach these concerns much like the sales pitch of a used car salesman: “Trust me. I personally know that this car was only driven on to church on Sunday's by the sweetest, little old lady you could ever meet.”

The few examples of textual data mining tools that I know are "correct" are so because they have identified a finite lexicon from which they have extracted a known set of associations. These applications are for targeted areas and have limited, if any, broad applicability. The "process" that was implemented in these applications consisted of a brute force analyses of the corpus and observation of the environment from which the corpus was derived. It is not a repeatable process, and as a result, there is no chance of developing an algorithm or quantitative method to provide such analyses. In terms of "correctness," I can state with 100% that for the referenced applications the defined associations across documents are correct. Please note that I have said nothing about completeness. That is, it is unknown if every potential contextual association across documents are identified. One can assume that all such associations cannot be identified a priori.

As classes of textual data mining tools evolve that do not require a fixed lexicon or an a priori set of contextual associations, the need for a repeatable process to demonstrate both correctness and completeness of the derived information is of paramount importance. Without such measures, the end user has no way of knowing the validity of derived information. Similarly, the tool developer has no way to verify the correctness of the extracted data. Until there exists an analytical means to verify and validate a textual data mining process, than I assert that the confidence in the results provides is, at best, questionable.

Monday, June 23, 2008

Learning Robots

Came across the site Learning Robots and was really impressed. From the site:

Some hardwired, pre-programmed robots such as TU Munich's humanoid walking biped and BU Munich's fast robot car perform impressive tasks. But they do not learn like humans do.
So how can we make them learn from experience? Unfortunately, traditional reinforcement learning algorithms are limited to simple reactive behavior and do not work well for realistic robots.

Robot learning in realistic environments requires novel algorithms for learning to identify important events in the stream of sensory inputs, and to temporarily memorize them in adaptive, dynamic, internal states until the memories can help to compute proper control actions.

Correctness and Utility

A theme I've been working on the past few months is about the interplay of correctness and utility. At times there is a tradeoff between the two concepts and I think they deserve discussion. Generally speaking in computer science terms, corretness applies to the amount an algorithm of implemented software compares to a specification. Given a specification for addition, an algorithm that takes 2 and 2 and produces a value of 4 is deemed "correct." What a lot of people have tried in the past with machine learning is to impose a correct model of language on a system and then shoe horned the data into that model. While the results work reasonably well for white papers, they don't for the 99.9% of all other inputs.

The reason for this is because language itself is not correct. In almost all documents, this one included, you will find spelling mistakes, bad diction, bad grammer, neologisms, double negatives, sarcasm, run-on sentences and so many other ills. T33n SMS Sp3@k... You name it, we manage to communicate in spite of the rules of standard language. In fact at times we invent grammer, words and turn things on their ear to communicate more specifically and with more impact than if we had just made statements in standard correct English. Take a look at advertising, literature or even the script they handed Frank Oz when he took on the part of Yoda.

So even if I spell something wrong or perhaps use awkward phrasing can you still make utility out of what I write? Can you still find the essencial meaning of my text? We all know this is essencial for data mining, text analytics and machine learning. We have to overcome human weakness in the way that humans do. We have to be flexible. We have to value utility over correctness because what we have to work with is, itself, not correct.

This leads to another thought which I won't expand upon much here but requires it's own series of articles. When you score a system for its quality of analytics it would be a huge mistake to spare it from having made a mistake due to the text itself being incorrect. The reason why is we need to accept the fact that text will always have mistakes in it. While it is understandable why your system did not get 100% it would be important to rate a system that did get the right relationship more highly.

I'll be writing more of my concepts on quality of analytics as time goes on.

Sunday, June 22, 2008

Educate thy self

Since I've been working in this field for only three years I've had the need to educated myself about my own language. While knowing what a past participle is in real life won't improve your ability to speak or communicate it is important to know what it is when you are trying to teach a comptuer to find important relationships in unstructured documents.

So here are some useful websites that I enjoy and have used to educate myself and as reference:

  • Dr. Grammar - a very useful site for improving your understanding of grammar.
  • Online Writing Support - Towson University's great resource
  • ESL in Canada - generally any ESL sites are fantastic for adults looking to reducate themselves in English and this one is one of my favorites
  • Part of Speech Tagging - yeah, gotta have at least one Wikipedia entry. This one is worth going over though!
  • UPenn Treebank - you are not involved in part of speech tagging and text analytics if you are not familiar with this project. Seriously.
  • Text Analytics Wiki - this is a new one to my collection. It has promise. Give it a look!
  • Visuwords - Using the Princeton Wordnet database this is a very visual dictionary/thesaurus. Very useful
  • LIWC - an interesting peice of software that I am looking at now. Linguistic Inquiry and Word Count (LIWC) is a text analysis software. LIWC is able to calculate the degree to which people use different categories of words across a wide array of texts.

Well that should get you started. Later on I'll start a blog roll of blogs I think are worthy of merit.

Saturday, June 21, 2008

Trust me, I know what I'm doing...

As I come up with ideas for blog entries, it allows me to look back at the lessons learned during my college years. One memory in particular that lends itself well to this forum was a physics experiment gone bad. The premise of this experiment was to measure the speed of light using lasers and fundamental measuring equipment. Once my lab partner and I performed the rudimentary testing of the individual pieces of equipment to ensure that everything was in working order, we proceeded with the experiment before us. At the end of the lab session, we had unequivocally proven that light crawled at a dismal 10 cm/sec*sec. Without a doubt, our discovery could set the fields of physics and science back centuries. As the teaching assistant peered over our results, he looked at us with total disdain. He informed my lab partner and me that we were to remain in the lab to demonstrate to him how we achieved this remarkable finding.

As the lab room became vacant, the teaching assistant had as walk through the experimental process that we performed that provided our amazing discovery. As we set-up the various lens used to refract the laser beam for measurement, he started shaking his head. My lab partner and I had reversed two of the lenses, and as a result, we were not measuring the refracted light at the appropriate angle! Even though we had painstakingly tested each component of our equipment prior to set-up to ensure its viability, we never considered incrementally testing the set-up during assembly. Rather, we naively believed that since each individual part worked correctly, than so would the assembled creation. To this day, I carry the lesson learned from that day: test early, test often and than test again! If everyone followed this hard learned philosophy, then the world of software development would be a much better place.

Friday, June 20, 2008

My dog taught me everything I know...

My first serious exposure to machine learning occurred in the fall of 1986 when I took an introductory graduate course in robotics. To this day, I still remember how awe inspiring it was to develop the control language that allowed a robotic arm to pour a sequence of alcohol from bottles to create a mixed drink. It must be noted that the bottles required arrangement in a particular order, but that did not dampen our enthusiasm. We harnessed the cutting edge technology of the day to perform a task that would entertain any college student: we “trained” a machine to create a cocktail.

Now while this feat may not seem very awe inspiring today, it demonstrates a very fundamental principle of machine learning/artificial intelligence: it can never surpass the available technology or separate itself from its dependence on humans. In the twenty odd years since this event, processors have evolved from 8-bit machines to 64-bit machines and beyond. Memory has had an equally impressive evolution. Similarly, robotic arms are now used in many facets of manufacturing in lieu of humans. However, the tasks performed by these machines are still devised and programmed by humans. We still have not harnessed the capability to allow machines to teach other machines how to perform a task, and in turn, demonstrate true artificial intelligence.

Even as technology evolves and allows machines to perform more complex tasks, there is still an intrinsic need for a human to identify the task, to develop a process by which a machine can learn the task, and then determine if the machine can properly perform the task. However, this process can never be undertaken unless the available human-developed technology supports the creation of the needed machine. Similarly, the quality of the task performance by the machine is intrinsically related to the human capability to devise a sufficient training schema. Therefore, I assert that machine learning/artificial intelligence as it stands today is simply a model or collection of models reflecting the beliefs of its creator. This statement should not be taken as ridicule, but rather as a stern rationalization of fact. Furthermore, I assert that this belief should be infused across any application that uses a computer-based system. If we forget the fact that humans are fallible and humans create the machines and processes that support machine learning/artificial intelligence, then we as a society will suffer the consequences. If we recognize this fallibility of human design, then the machine learning/artificial intelligence community at large must be begin to address en masse how to demonstrate that their creations are validated and verified.

Thursday, June 19, 2008

-1 Days since Machine Uprising

The title of this entry comes from a safety sign on the floor of Kiva Systems where they make very impressive commercial robots (but don't call them robots, they tell me!)

This blog's purpose is to discuss the concepts of machine learning. I've been working in the field of machine learning for 3 years at Digital Reasoning Systems, Inc. I've been in IT professionally since 1989. My tour of Kiva last night was through a friend who I have known for about 4 years but until last night had never met. We play a very popular computer game on xbox over the Internet. What surprised me greatly was that some of the systems used in the video game for designing floor layout for games was actually quite similar but not as powerful as the systems Kiva has for its drive units (*whisper* they are robots!)

The work I do is related to text analytics. DRS does entity extraction but so do 3 dozen other companies I could name. Add in European companies and that number goes up and then add in Asia companies and it expands again. The thing I've been thinking about all day is that eventually machines will respond to our comments, rearranging our space without much in the way of our intervention. Decisions are already being made based upon text analytics to improve the experience of the customer. Assuming we survive the coming energy singularity point and fend off the collapse of society we have a pretty bright future ahead of us. If not, then we will be fighting drive units for every inch of ground. Ok, maybe not but that won't stop Hollywood from speculating.