Many institutions are struggling to solve problems with
their official Web sites, which represent a daunting maintenance task because
the contents constantly change. Editors
can't exert sufficient control because the sites that they manage often point
to external sites that have the same problems.
One unfortunate effect is that an institution's major presence on the
Web is often notoriously difficult to navigate, despite the fact that
cross-linked sites maintained by official Webmasters have collections of
documents on related subjects. If the
user is lost in a tangle of obsolete pages and links, so is an opportunity to
take advantage of the original human effort that was invested in imposing order
on a small corner of cyberspace.
We believe that part of this problem can be solved by adding
subject-rich metadata to Web pages. Ideally, automated or semi-automated tools
would perform this task whenever a Web page is created or modified, fulfilling
one of the visions of the Semantic Web.
Right now, we can only simulate this vision by manipulating the pages as
if the authors could comply with the evolving Semantic Web metadata
standards. To illustrate, the demo
accessible from this page uses real-world data from OCLC's official site,
www.oclc.org, which has many links to the
Dublin Core site, dublincore.org,
as well as to the World Wide Web Consortium's site,
www.w3.org.
Our processes collect keywords and phrases
using natural-language processing software that operates on HTML pages and
organizes them into a topic map that resembles a thesaurus. The result is
visible to the user in a browsing tool that appears to sit on top of all three
sites and offers some new strategies for subject navigation that are not
available in the original sites.
Our demo is built from Open Source software and can be
downloaded from this site.
The software
that we have written harvests Web pages, extracts subject-oriented keywords and
phrases, populates a mySQL database containing RDF triples, and organizes the
concepts into a topic map.
The results
are made available to the user through an interface rendered by XSL and XML
stylesheets.
The processes can be
integrated to create an environment for implementing and improving tools for
subject-oriented browsing of Web collections.
Some of the topics discussed in the three Web sites in our
demo are shown in this figure.
You can start your exploration of the topic map by typing libraries,
classification, metadata or xml in the Search box.
Pressing the Search Topics button
places you in the topic map at the search term, while pressing the Extended
Search button produces a list of documents organized by the search term
and the closely related terms that are represented in the topic map. On every
screen, you have the choice to explore the topic map or retrieve Web documents
that have been organized by topics.
Clicking on a term keeps you in the topic map, while clicking on a
document title retrieves a document.
|