Thursday, August 14, 2008

Punctual Punctuation

I've been looking at the output of a text processor/pos tagger and noticed that there is a whole class of error I haven't been looking for but should and that is punctuation. It makes a big difference when predicting the start and end of a sentence on the tagging of POS. Especially with words that may be nouns or verbs depending on their context.

Mainly the biggest problem i see is in the handling of the period which for Americans at least appears as the decimal separator for numbers, in abbreviations and at the end of sentences. So it becomes important to discover these three cases accurately. I noticed in some cases that decimal was being concatenated with the prior word (clearly a bug with differentiating abbreviations from words!)

I believe I will treat this kind of error in the same class as lemma error in that the POS prediction is discounted and error is attributed to form. I've been working with this kind of measurement for a long time and I'm surprised I hadn't noticed this class of error before. However it is good to see it now and incorporate it into my methodology.

No comments: