Tuesday, November 18, 2008

Can you use adjectives to infer nouns in entity extraction?

I will limit this case to location extraction since this is a common case and one where I seem to find a lot of confusion. To give away the punch line the definitive answer is no, you can't use adjectives. The full answer though is more subtle than that. Read on, read on.

The use of adjectives for entity extraction looking for locations has many flaws. Semantically, a location can only be a noun. The theory is that in a sentence such as "The ship was beached along the Boston Coastline" that the word Boston is the location. Boston in this context is an adjective. There are two cases where this is provably bad and incorrect and one case where it just simply is misleading.

Case 1: Location Name has Multiple Meanings:

There are location names that are perfectly valid that have additional common meaning. Bad, Iran or Train, Germany or Hit, Iraq or Car, Romania are just 4 examples of hundreds worldwide. If we take the adjective case then when we find "train station" or "bad karma" or "I'd hit that" (ok that's a predicate, you got me - I just wanted to use that phrase!) or "car wash" then we can infer locations. This is clearly absurd. For names like Boston that seem to be unique we find that they have additional meaning. Boston is a transliteral variant of the Uzbek word Baystan. What it means I have no clue.

Case 2: Adjective looks like a location but isn't

There are many of these. Boston cream pie, Boston Bruins, Boston Marathon, New York minute, etc. No discussion really needed here.

Case 3: Adjective is sort of a location but is misleading

Boston Marathon also falls in this example. The Boston Marathon does indeed terminate in Boston. In fact less than 5% of the race is in the city of Boston. The other 95+% runs through many places. Along the way you find another misleading place called Boston College at the top of Heartbreak Hill. The problem there is that Boston College is not in Boston. It is in Chestnut Hill.

Further, these types can be used as locations. Ex. I am going to the Boston Marathon or I am heading to Boston City Hall. They are destination - but they are more specific than just the city.

What is the correct method of dealing with these cases? A good semantic system not only is aware of the parts of speech of individual words, it recognizes important co-locations of words. So instead of seeing:

The: Determiner
Boston: Adjective
Coastline: noun
is: predicate
beautiful: modifier
.: end of sentence

It would see

The: determiner
Boston Coastline: noun
is: predicate
beautiful: modifier
.: end of sentence

In an ideal solution Boston Coastline would then be in your gazetteer with correct coordinate representation. While the software I work on does not handle linear or polygonal targets and could not represent even a bounding box for this type of target today it would not be terribly difficult to implement. It is a simple matter of programming. Entity extraction software that correctly groups terms like "Boston Coastline" are advanced. That is what our software does. That is a significant problem to solve. Once done it is simply a matter of having a properly populated gazetteer and an ability to distinguish in the database between point and area features.

Finally a good semantic solution looks not only look at the words but how they are used in context.

I like to travel to Train, Germany.

vs.

I like to train in Germany.

In the first case we definitely are talking about Train as a noun and location. In the second case it is a predicate and not a location. Imagine further that this is all caps data and it becomes an even more difficult problem.

So avoid the trap of thinking that placename lables are useful without having an understanding of their groupings, parts of speech and context.