ManticMoo.COM | All Articles | Jeff's Articles | Scholarly |
Word Sense Disambiguation Using WordNet
by
Review by Jeffrey P. Bigham
A text information retrieval system deals with the task of finding as many relevant documents as possible while rejecting irrelevant ones. Disambiguating terms based on the sense in which they are used would seem to have the potential to improve search results by preventing documents that contain terms used the incorrect sense from being returned. Furthermore, matching documents based on the sense of a word, instead of the precise word itself, would allow documents that contain words used in the same sense as the query word to be returned even if they don't contain the exact query terms themselves. After a brief overview of the various ways in which word sense disambiguation can be performed, this paper will discuss a particular implementation constructed by the author and the results and conclusions that can be drawn from it.
The problem that word sense disambiguation has the potential to solve can be broken into two parts: homonymy and synonymy. Homonymy means that one word has two or more different senses in which it can be used, which could result in documents that are not relevant to a query being retrieved anyway. One attempt at solving this adds additional words to the query that can help disambiguate the terms used in it to the concept that the user intended. Unfortunately, this is not always possible and it is usually difficult to automatically determine what words would be good additions even when it is. One example of this problem is a query consisting of the single term "bat." Without additional information it is impossible for an information retrieval system to determine if the user wanted documents about baseball bats or flying mammals. If the query had been disambiguated using the additional words “baseball” or “flying mammal” the search request would have been much less ambiguous and relevant documents would have been more likely to be returned. The term has just been disambiguated into two different senses using the additional terms "baseball" and "flying mammal." Including this type of information in the information retrieval process would intuitively seem to help, but it is possible that this method might not always be beneficial. Because documents that are not relevant to the query have the possibility of being returned anyway, this method has the potential to hurt the precision of an information retrieval system. (Manning)
Synonymy means that more than one word refers to the same concept or sense. This can be a problem because it could allow documents that are relevant to a query to not be retrieved because they don't contain the specific words used in the query. This problem can be addressed by using methods such as Latent Semantic Indexing but can also be addressed by adding additional words that are used in the same sense as the query terms or are associated with them. Because some relevant documents could be prevented from being retrieved, synonymy has the potential to hurt the recall of an information retrieval system.
Numerous approaches for word sense disambiguation have been explored. In supervised disambiguation, a corpus that is labeled according to word sense is available and the main task is to classify new cases. This can be done by employing such methods as Information-Theoric Disambiguation, which attempts to find contextual features that reliably indicate in which sense a particular word is used, and Bayesian classification, which computes a probability of a given sense given a context. The downside of this approach is that the corpus must be pre-labeled, which is a time-consuming task that usually must be performed by hand. (Manning)
The category of disambiguation at the other end of the spectrum is unsupervised disambiguation. Bayesian classification can be used here, much as the similarly named method was used in Supervised Disambiguation. The difference in this case is that none of the probabilities are known beforehand. Instead, a maximum likelihood calculation is performed. This involves iteratively calculating the probabilities until they settle on a value. (Manning)
Another general approach to word sense disambiguation is dictionary-based disambiguation. This approach can be used when nothing is known beforehand about the sense of particular instance of a word but a listing of the general senses in which a particular word can be used is available. Dictionary definitions, thesauri categories, and WordNet synsets can all be used for this purpose. The problem with this approach is that it both requires the determination of which of the many senses that may be listed for a word actually refers to the instance in question and that sources may list many senses for a particular word, which may lead to over-fitting to the data. For example, there are 28 different definitions listed on dictionary.com for “run.” If each of these were treated as different senses, it might be too specific. (Manning)
The results for using WordNet for word sense disambiguation used in information retrieval has been mixed. While several groups have found that doing so provided little if any benefit, at least one group has achieved favorable results by indexing based on WordNet synsets. They demonstrated that indexing according to WordNet synsets could improve precision from 48% to 62% over the basic SMART information retrieval system. Their system relied on hand-disambiguated documents and queries so the problem lies in the accuracy of this disambiguation and the effect that it has on performance. They found that disambiguation errors of less than 10% did not substantially affect performance and that their system remained better than the basic SMART run at up to 30% disambiguation errors. While far from concluding that WordNet could solve all word sense confusion in information retrieval it did show that a system like this has promise. (Gonzalo)
My project explored the potential of WordNet for this task. The advantages of using WordNet are numerous. First, it is readily (and freely) available and was designed for manipulation by computer. At its core, WordNet is a graph of words connected according to various relations that WordNet has defined between two difference word senses. This structure has the potential to discriminate word senses in documents and queries and to match semantically related words. Furthermore, Perl modules such as Lingua::Wordnet, Lingua::Wordnet::Analysis, and WordNet::QueryData provide easy access and allow for straightforward manipulation of the information contained in the WordNet database.
The goal of this project is to construct my own implementation of an information retrieval system that utilizes WordNet for word sense disambiguation and do some analysis on it to see how its performance compares to simple systems that have been created.
Method
To test the effect that word disambiguation can have on information retrieval I implemented an interface to the AltaVista search engine that utilizes features of WordNet in an attempt to improve the relevancy of the documents that AltaVista returns. I chose AltaVista for this purpose because it uses a rather simple form of matching to decide which documents to return. I wanted to see what improvement my system would provide over simple term by term Boolean information retrieval and using a search engine like Alta Vista allowed such a comparison to be made.
The basic idea behind the system I implemented is the following: A user submits a query to my interface program, which then proceeds to calculate the most likely senses of each of the words in the user’s query. It then returns a scored list for each word to the user (lower scores are better). The user then checks to see if the sense that the program chose for each word is the best of those listed. If all of them are correct or if the user just wants to go with the defaults, then he can just press the submit button to move on to the AltaVista results page. If they aren’t, he can change the options to what he feels are the best characterizations of the senses of the words in the manner in which he used them. Words that are associated with the chosen word senses are then added to a search query that is sent to AltaVista and to which the user is finally redirected.
Determining which word senses best match the each word in the user’s query is done through a combination of various functions contained in the WordNet Perl modules, based on example usage provided by the Lingua::WordNet module documentation. In my program, word senses are compared via the get_min_dist() function, which finds the shortest distance to a hypernym that is shared by two senses of a given word. A hypernym is defined by WordNet to be “a word that is more generic than a given word.” All words are related at some level by a hypernym, even if this hypernym is something very generic like “entity,” something to which almost all words are related. The senses that are closest together using this metric are considered to be most likely to be the correct senses of the word. The final weights for each sense in this step are determined by penalizing very familiar hypernyms according to the value returned by the familiar() function contained in the Lingua::WordNet::Analysis module. The reason for doing this is that words that are very unrelated might be related by something very generic that is likely to apply to many words. Such a relation will be very generic but could actually be fairly close. Penalizing based on this offsets this effect to some degree.
If more than two words were included in the query then the final weight of each sense is the additive combination of the likelihood of each sense based on it occurring with each of the other words. Nouns, verbs, and adjectives are the only parts of speech considered because they seem to be the most important for sense disambiguation and the program makes sure to only compare the sense of nouns with nouns, verbs with verbs, and adjectives with adjectives. Because WordNet will always contain a score of some sort even when comparing different senses using this metric I obtained confusing results when I did not separate the comparisons based on part of speech.
After the weights are calculated, the one with the lowest score (lower scores are better) and those within a threshold of this score are returned to the user for examination. If Lingua::Wordnet found no senses of a given word, the word is looked up using the module WordNet::QueryData, which contains many more synsets although it is slow to load initially and provides fewer functions for the manipulation of the data it returns.
After this data has been returned, the user chooses the best sense for each word out of those returned or just stays with the default values returned by the program and submits his choices. The program then takes the this information and combines it with the original query to formulate a much larger query based on the synsets of the words that were chosen along with it. Of those words added to the query, repeated terms and terms that appeared in the original query are removed. Next, the words to be added are ranked according to their lengths and the first ten are chosen for inclusion in the query. This makes the maximum query length that AltaVista sets unlikely to be exceeded.
The format of the query is then as follows:
($orig_query) NEAR ($additional_query_words)
where $orig_query is the original query words separated by AND’s
and
$additional_query_words contains the query words added by the program separated
by OR’s
Data/System Analysis
I evaluated my system by first searching for ten different queries using both my method and a basic AltaVista search and scored the top ten pages returned based on their relevancy. In my analysis, relevant documents received a score of 5, partially relevant documents a score of 1, and irrelevant documents a score of 0. After collecting this data I performed a discount cumulative gain calculation for the results of each query. This allowed a fairly straightforward comparison between my system and the AltaVista results in these ten head-to-head competitions.
The queries that I chose were fairly subjective and mainly consisted of queries that I thought would be a good test of the system because the terms included in them could be very ambiguous but they could be disambiguate easily by a human based on the other words used in the query. The full results of this process are included in the appendix, but a summary of the results appears in the table below.
Search Terms |
DCG of AltaVista Results |
DCG of WordNet-Enhanced AltaVista Results |
||||||
|
|
|
||||||
|
|
|
||||||
|
|
|
||||||
|
|
|
||||||
|
|
|
||||||
|
|
|
||||||
|
|
|
||||||
run race |
|
|
||||||
|
|
|
The results that I obtained from this somewhat limited number analysis are promising although far from conclusive. The AltaVista search enhanced with WordNet outperformed the regular AltaVista search in 6 of the 10 trials. The words that were added from WordNet seemed to serve to increase the specificity of the search, which often meant that relevant results were more likely to be returned. For example, when I search for “spring drink” I was looking for references to drinking from a natural spring or at the very least drinking spring water, and, while my system reported several instances of this, the regular AltaVista search contained results that merely had both of those words on the page but not related to one another. The additional words added by my system seemed to prevent AltaVista from simply returning pages that made unconnected references to both and made it return pages that were actually about springs and drinking from them.
The increased specificity that the enhanced version provided also sometimes hurt performance when it made the query too specific to a concept related-to but not the-same-as the target concept. This happened in the case of the search for “artificial grass pictures.” The WordNet-enhanced query contained terms such as “representation”, “pasturage”, and “surface” which caused a lot of sites about real grass and farmland to be returned. The regular AltaVista search seemed to be “more wrong” in a sense in the irrelevant pages that it returned, often returning sights advertising artificial limbs for example, but it also returned a relevant document in the first position. Of course it’s hard to determine the statistical importance of that one document.
One of the goals of this project was to expand the query so that it included words associated with those included in the original query so documents that dealt with the right concept but used different words than the query would still be returned. While this often helped, it also sometimes caused the enhanced version to return results that were much more general than what the query indicated. For example, the enhanced version returned a lot of generic pages about baseball when it searched for the query “bat baseball” but didn’t return any that were specific to baseball bats. The regular search, however, returned several such pages, presumably because it was not distracted by the inclusion of a bunch of optional general terms. This also appeared to be the problem with the searches for “genetic algorithm evolution” and “artificial grass picture.” Both of these queries were already very specific and adding the additional terms just introduced related but irrelevant documents to the results.
Conclusion
A system like this one that adds terms to basic queries based on automatically disambiguating word senses using WordNet has the potential to improve the results of basic information retrieval systems. In that sense it confirmed the mixed results that other projects have found. More research would have to be conducted to determine conclusively if the addition of this type of system would be worthwhile to more advanced, better-performing search systems such as Google. In any case, it seems that further research in this area is warranted and could produce amazing results in the future.
Works Cited
Modern Information Retrieval. 1999. New York, NY.
Brian, Dan. “Lingua::Wordnet.” 24 April 2002. <http://www.brians.org/wordnet/article/>
Gonzalo, Julio, Felisa Verdejo, Irina Chugur, and Juan Cigarran. “Indexing with WordNet Synsets Can Improve Text Retrieval.” Ciudad Universitaria.
Manning, C. “Word Sense Disambiguation.” The Purpose. 18 April 2002. <http://citeseer.nj.nec.com/41667.html>
ManticMoo.COM | All Articles | Jeff's Articles | Scholarly |