
OCLC Research
OCLC Research Software
Main
Demonstration
Download
Theory
Resources
Contact Us
|
RDF Topicmaps : Theory
Our goal in the Topicmaps project is to bootstrap the efforts to meld natural-language-processing technologies with Semantic Web development. All of the components in topicmaps.jar can be improved upon, but we offer this collection of source code as a "bare-bones" starting point. In this section, we offer some suggestions for further development.
- The noun phrase extractor.
To extract noun phrases, the text must first be tagged by part of speech. The noun phrase extractor then selects the sequence of tokens with tags that make up grammatical English noun phrases--eg, adjectives, nouns, conjunctions and determiners (such as "the"). The part-of-speech tagger we used simply looks up the tag in a dictionary. If it's not found, the tagger assumes the word is a noun. This simple scheme works reasonably well, but the standard public-domain part-of-speech tagger, written by the computational linguist Eric Brill, is
available at:
http://www.cs.jhu.edu/~brill.
- The noun phrase filter.
The trick to getting usable results for the topicmap is in the design of the filter. This is a big topic, so we would like to give you some guidance here.
Some suggestions:
- The code in this installation implements a couple of simple ideas. An
automated approach could take advantage of the frequency information that
is collected. We also provide two lists that permit some limited human
intervention. The first is a stopwords list, located in the directory
"wordsmith/ORG/oclc/wordsmith/ngrams/stopwords.txt." Here you can
include words that cause problems for your dataset, such as names of HTML tags that weren't stripped out in the preprocessing step, etc. The other file in that directory, filtered.terms, is a set of "gold standard" terms that may be identified with human input from the subject domain of your data. It is used in
the class rdfmain.RDFRelations to create meaningful sets of syntactically related terms because we can't reliably assign internal structure to noun phrases that consist of more than two words without more sophisticated processing or human guidance. Godby and Reighart (1998) has more discussion of this point.
- If you want to create a more sophisticated noun-phrase filter, there are two approaches. We have experimented with a "knowledge-rich" approach that analyzes text for simple, easily computable cues that a given noun phrase is used as a word. Godby and Reighart (1999, 2001), and Godby and Smith (2002) and Godby (2002) describe this strategy. The file
"wordsmith/ORG/oclc/wordsmith/ngrams/filtered.terms" contains the output of the algorithm described in Godby (2002a) when it was applied to oclc.org, portions of w3c.org and dublincore.org, the data that was used to create our version of the RDF Topicmaps demo. Another approach is encoded in the Kea project, available at
http://www.nzdl.org/Kea/.
It uses "knowledge-poor" information-retrieval measures to identify significant phrases in documents. The Kea site also has downloadable Open Source code for identifying and filtering significant noun phrases.
- The relationship generator
Our goal in writing the relationship generator was to identify simple, thesaurus-like relations such as "broader-than" using only a list of words as input. Thus, it is limited to sets of words that have partial string matches. This relationship accurately captures the relationship between "linguistics" and "computational linguistics" but does not identify more abstract relationships like the whole-part relation in "car" and "engine." This problem is far more difficult and error-prone. Caraballo (2001) has an up-to-date discussion. The acronym identifier can also be improved. See Bowden, et al, 1998 for more discussion.
References
Paul Bowden, Lindsay Evett, and Peter Halstead. Automatic acronym acquisition in a knowledge extraction program. In Bourigault, D., C Jaquemin and M.L'Homme (eds.), Computerm '98: First Workshop on Computational Terminology: Proceedings of the Workshop, pp 43-49. 1998.
Sharon Caraballo. Automatic Construction of a Hypernym-Labeled Noun Hierarchy. Ph.D. Dissertation, Department of Computer Science, Brown University, 2001.
Carol Jean Godby. A Computational Study of Lexicalized Noun Phrases in English. Ph.D. dissertation, Department of Linguistics, The Ohio State University, 2002.
Carol Jean Godby and Devon Smith. Strategies for Subject Navigation using
RDF Topicmaps. Presentation at the Knowledge Technologies Conference, March
2002. Accessible at: http://staff.oclc.org/~godby/auto_class/godby_kt2002.ppt
Carol Jean Godby and Ray Reighart. The OCLC WordSmith Indexing Project.
In: Annual Review of OCLC Research. Accessible at:
http://www.oclc.org/research/publications/arr/1998/godby_reighart/wordsmith.htm. 1998.
Carol Jean Godby and Ray Reighart. Terminology Identification in a Collection of Web Resources. In: Karen Calhoun and John Riemer, eds. CORC: New Tools and Possibilities for Cooperative Electronic Resource Description, pp 49-66. New York: The Haworth Information Press, 2001.
|