Thursday, July 17, 2008

Arrrg. Working with other peoples data

The biggest pain with text analytics and data mining is working with other peoples data. Invariably it is all garbage. This file is ASCII, that file is UTF8, this other file has some weird code page, that file is 7bit. It can cause you to pull your hair out. This happens with data from every corporation or even the Federal Government. I was just working with the TraxIntel DB and was having trouble with the analysis missing what I thought was a lot of important information. Of course it was something as simple as just files not formatted the way I was expecting. This should have been expected since TI pulls in data from a lot of different sources. The lessons is, pre-process all of your input and make sure it is formatted exactly the way you want.

I should have known better because when I was working with the public collection of Enron emails they had all sorts of encoding in them that are unique to emails but that aren't text you'd want to analyze. The simple solution was to find plain text versions of the messages. The purist answer of course is to create a filter for various types of potential file formats and fix the input before it hits the semantic analyzers. What a concept.

Anyway, the point of this post is that sometimes when you are debugging this stuff it isn't the high level models and complex software that is broken. "Garbage in means garbage out." A lesson as old as the hills.

No comments: