Monday, June 23, 2008

Correctness and Utility


A theme I've been working on the past few months is about the interplay of correctness and utility. At times there is a tradeoff between the two concepts and I think they deserve discussion. Generally speaking in computer science terms, corretness applies to the amount an algorithm of implemented software compares to a specification. Given a specification for addition, an algorithm that takes 2 and 2 and produces a value of 4 is deemed "correct." What a lot of people have tried in the past with machine learning is to impose a correct model of language on a system and then shoe horned the data into that model. While the results work reasonably well for white papers, they don't for the 99.9% of all other inputs.

The reason for this is because language itself is not correct. In almost all documents, this one included, you will find spelling mistakes, bad diction, bad grammer, neologisms, double negatives, sarcasm, run-on sentences and so many other ills. T33n SMS Sp3@k... You name it, we manage to communicate in spite of the rules of standard language. In fact at times we invent grammer, words and turn things on their ear to communicate more specifically and with more impact than if we had just made statements in standard correct English. Take a look at advertising, literature or even the script they handed Frank Oz when he took on the part of Yoda.

So even if I spell something wrong or perhaps use awkward phrasing can you still make utility out of what I write? Can you still find the essencial meaning of my text? We all know this is essencial for data mining, text analytics and machine learning. We have to overcome human weakness in the way that humans do. We have to be flexible. We have to value utility over correctness because what we have to work with is, itself, not correct.

This leads to another thought which I won't expand upon much here but requires it's own series of articles. When you score a system for its quality of analytics it would be a huge mistake to spare it from having made a mistake due to the text itself being incorrect. The reason why is we need to accept the fact that text will always have mistakes in it. While it is understandable why your system did not get 100% it would be important to rate a system that did get the right relationship more highly.

I'll be writing more of my concepts on quality of analytics as time goes on.

No comments: