Tuesday, June 24, 2008

Would you buy a used car from this man?

Today, the emphasis on textual data mining is on the breadth of unstructured text that can be reviewed. For example, many tools emphasize the volume of data that can be ingested. However by emphasizing this approach, the quality of the mined data is often ignored. Just because a tool can ingest hundreds of thousands of document within a tractable time period does not mean that the produced results are meaningful, accurate or pertinent. Currently, there are no widely accepted measurement tools that can provide insight to the quality of the mined data, including he integrity of the derived associations, or its usefulness to the end user. Rather, the suppliers of such tools approach these concerns much like the sales pitch of a used car salesman: “Trust me. I personally know that this car was only driven on to church on Sunday's by the sweetest, little old lady you could ever meet.”

The few examples of textual data mining tools that I know are "correct" are so because they have identified a finite lexicon from which they have extracted a known set of associations. These applications are for targeted areas and have limited, if any, broad applicability. The "process" that was implemented in these applications consisted of a brute force analyses of the corpus and observation of the environment from which the corpus was derived. It is not a repeatable process, and as a result, there is no chance of developing an algorithm or quantitative method to provide such analyses. In terms of "correctness," I can state with 100% that for the referenced applications the defined associations across documents are correct. Please note that I have said nothing about completeness. That is, it is unknown if every potential contextual association across documents are identified. One can assume that all such associations cannot be identified a priori.

As classes of textual data mining tools evolve that do not require a fixed lexicon or an a priori set of contextual associations, the need for a repeatable process to demonstrate both correctness and completeness of the derived information is of paramount importance. Without such measures, the end user has no way of knowing the validity of derived information. Similarly, the tool developer has no way to verify the correctness of the extracted data. Until there exists an analytical means to verify and validate a textual data mining process, than I assert that the confidence in the results provides is, at best, questionable.

No comments: