OCLC Home Page
OCLC Research
OCLC Research Software

Main

Demonstration

Download

Theory

Resources

Contact Us

RDF Topicmaps

Project Overview

Many institutions are struggling to solve problems with their official Web sites, which represent a daunting maintenance task because the contents constantly change. Editors can't exert sufficient control because the sites that they manage often point to external sites that have the same problems. One unfortunate effect is that an institution's major presence on the Web is often notoriously difficult to navigate, despite the fact that cross-linked sites maintained by official Webmasters have collections of documents on related subjects. If the user is lost in a tangle of obsolete pages and links, so is an opportunity to take advantage of the original human effort that was invested in imposing order on a small corner of cyberspace.

We believe that part of this problem can be solved by adding subject-rich metadata to Web pages. Ideally, automated or semi-automated tools would perform this task whenever a Web page is created or modified, fulfilling one of the visions of the Semantic Web. Right now, we can only simulate this vision by manipulating the pages as if the authors could comply with the evolving Semantic Web metadata standards. To illustrate, the demo accessible from this page uses real-world data from OCLC's official site, www.oclc.org, which has many links to the Dublin Core site, dublincore.org, as well as to the World Wide Web Consortium's site, www.w3.org. Our processes collect keywords and phrases using natural-language processing software that operates on HTML pages and organizes them into a topic map that resembles a thesaurus. The result is visible to the user in a browsing tool that appears to sit on top of all three sites and offers some new strategies for subject navigation that are not available in the original sites.

Our demo is built from Open Source software and can be downloaded from this site. The software that we have written harvests Web pages, extracts subject-oriented keywords and phrases, populates a mySQL database containing RDF triples, and organizes the concepts into a topic map. The results are made available to the user through an interface rendered by XSL and XML stylesheets. The processes can be integrated to create an environment for implementing and improving tools for subject-oriented browsing of Web collections.

Some of the topics discussed in the three Web sites in our demo are shown in this figure.

You can start your exploration of the topic map by typing libraries, classification, metadata or xml in the Search box. Pressing the Search Topics button places you in the topic map at the search term, while pressing the Extended Search button produces a list of documents organized by the search term and the closely related terms that are represented in the topic map. On every screen, you have the choice to explore the topic map or retrieve Web documents that have been organized by topics. Clicking on a term keeps you in the topic map, while clicking on a document title retrieves a document.