The title comes from a George Carlin joke and in reverence I've used it as a very appropriate title for today's entry.
From the people I spoke with at Text Analytics Summit 2008 it seems that every one gets recall and precsions, some get f-measure and few if any get any other measurement for analyzing the quality of analytics from products. This seems weird to me. First off the f-measure is pretty easy to get. What I find more difficult are defining recall and preceision. In fact it is questions of how to measure those that generally screw people up the most.
Recall: To simplify this consider that a document is full of entities. You have a conceptual set of relevent entities. It is important to make sure that when you go through the document you only find the ones that are actually relevent. For example if you were looking for Populated Place names (PPLs) then you would want to throw out anything that is a personification or adjective. "I'm going to Washington" would be good but "Washington was urged to sign the Kyoto Agreement" would not be. In the second case the entity Washington is the administation of the United States government. So assuming you have identified all of these then the next task is to sum them up. The sum of every hit that you get as a result that matches (with perfect registration) is compared against the total relevent entities and that is your recall. So if there are 10 PPLs and you get 6 of them then your recall is 0.6.
Precision: This is really simple. Take the number of relevent hits you have an divide by all the hits you have. So if you have 6 relevent hits but your total hits are 12 then your precision is 0.5.
Registration: This is where people cheat and fudge numbers. You have to show the instance of the term that was hit to know if you got it right. In the Washington example above if both of those sentences were in the target document then you'd want to know WHICH Washington was picked up. What cheaters will do is note how many Washingtons are relevent and then count the number of hits without checking registration so if there is a false positive it will look like a true positive. Another cheat I've seen is to take any hits on Washington and flatten them - ignoring the counts and just counting that as a true positive. These are real life examples and show you can't just trust the vendor.
So don't let someone scam you with their recall and precision numbers. Ask how they were derived. Don't just accept them as given. Once you have recall and precision then there are two ways you can calculate the f-measure:
With the weighted version you put in a value for b between 0.5 and 1.5 and it shifts from prefering recall to prefering precision. It depends on the individual needs on the analysis you are doing.
The point of this post is that you need to know what goes into making an accurate calculation of f-measure. The fact is that if you have someone doing it for you they have to really understand recall and precision. If you take shortcuts you reduce the benefit of the analysis to the point where you start promoting systems that just don't work. If you rely upon the vendor they are likely to sell you a pack of lies. The best approach is to be knowlege able about how to do the measurement and do it yourself or find someone who is skilled at doing it. In the end the security and comfort you get from validating the f-measure will keep you from losing sleep.