Introduction to the Semantic Web, Organizing the Web for Better Information
Retrieval
or
Can Librariansreally save the Web?
Paper presented at Knowledge Technologies 2002 - Seattle, March
2002 by Suellen Stringer-Hye- Vanderbilt University
Abstract
Libraries have long been storing and classifying the record
of the world's shared knowledge for efficient retrieval. The web, in many
ways, has become an extension of the library but without the structures
that make optimized retrieval possible. Additionally it is not possible
to employ traditional methods of classifying knowledge to the large sets
of data and information now being generated electronically. The World
Wide Web, in order to be truly useful, must adapt techniques used by libraries
for centuries as well as rely on new technologies not yet fully developed.
This paper will present an overview of the range of technological solutions
currently under consideration for the building of the "Semantic Web"---
a web maximized for information and knowledge storage and retrieval.
Introduction
One of the most famous libraries in the world is the Library of Babel.
This library is imaginary, invented by Jorge Luis Borges in his story
of the same name. The library of Babel is enormous as it contains all
of the world's knowledge but there are few librarians and no order to
its contents. To find anything one must wander the stacks, knowing that
what is sought exists, but not knowing in which of the infinite hexagons
in which all this knowledge is stored, to find it. The Library of Babel
contains all that ever was or ever will be but there is no way, to use
a bit of library jargon, to "access" this information. Although written
in 1941 one can easily see the resemblance of the Web to the Library of
Babel. Borges notes "When it was announced that the Library contained
all books, the first reaction was unbounded joy . . .” but adds later,
“[t]hat unbridled hopefulness was succeeded, naturally enough, by a similarly
disproportionate depression.” Librarianship as a profession exists to
connect people with information, knowledge and ideas. Uniquely situated,
as Bonnie Nardi points out in her book Information Ecologies, librarians
are a keystone species, crucial to the system of people, practices,
values and technologies that characterize a living information space.
Over the years many tools have been developed for linking people and their
inquiries with the information to satisfactorily fulfill the request.
But neither is the the web exactly like a library. As Ramonet stated in
, ‘La Tyrannie de la communication’ (Paris: Éditions Galilée,
1999), we are experiencing ‘prolifération d’une information de
plus en plus diffusée, et de moins en moins contrôlée’
(‘proliferation of information in a form which is more and more diffuse
and less and less subject to control’). While much can be learned from
the historical experience of Libraries and Library and Information Science,
traditional approaches to information management will need to evolve both
in the context of the library and the web in order to reflect this new
reality.
What is Information?
Halfway through the last century, information became a thing. In 1948
Claude Shannon published "A Mathematical Theory of Communication" in the
Bell System Technical Journal. From that moment forward, information
became a commodity, a quantity to be measured and analyzed. The meaning
of the content of the information played no role in this new technical
sense. The chaotic systems, strings of random numbers, bits and bytes
running through wires, though dense with information, were meaningless.
This definition of information fueled the modern industries of information
and communication delivery. Information may also be conceived, however,
as "something that is communicated" where communication is "a personal,
intentional, transactional, symbolic process for the purpose of exchanging
ideas, feelings, etc.". It is with this second definition that this paper
will be concerned, for libraries have been managing this sort of information
long before it traveled by wire, in circuits or found a home in a database
somewhere.
Of this kind of information there are two types. That which can be retrieved
and that which cannot. Most of what is and has been communicated by humans
over the ages can never be recovered. It consists in the infinity of unrecorded
conversations, discarded notes and jottings, telephone calls and destroyed
documents that have melted away like the snows. What remains are the few
mounds from which we may attempt to extract and make sense of the utterances
of our ancestors. Though the pace, quantity and methodology for accumulation
has perhaps accelerated, the process remains the same---humans communicate
and sometimes record, store and eventually preserve that communication.
Once it has been recorded it can be recovered. For most of "recorded"
history, communications were in the form of written documents and material
objects. Only in the last 200 years have we seen the proliferation of
non-text materials such as sound recordings, films, videos and photographs.
Much of this material can be turned into bits and bytes, blurring the
distinction between and the primacy of text based information. We now
live in a world in which most recorded information can be, if not always
is, digitized.
The Web and the Library
When Tim Berners Lee first proposed the concepts that later become the
World Wide Web, he was concerned with information loss. Knowledge at the
CERN laboratories in Switzerland, where he was employed, was shared through
conversation and social interaction but there was no means by which this
information, even if it had been recorded, could be retrieved, especially
in a manner that mimicked the flexibility and dynamism of an open organization
such as CERN. When the dream of a "universal linked information space"
became a reality larger than perhaps even he imagined, libraries encountered
the first real competition that they had ever faced. People now had access
to "the world of information" or rather "a world of information" without
needing the institution of the library to collect it for them. In 2000,
the Urban Library Council conducted a consumer market analyses of library
and web services. They determined that users (or patrons as they are called
in the library community) found the web more private, more entertaining,
easier to use, with more current information available at more times than
the library. Efficient search engines employing sophisticated algorithms
and natural language processing have delivered low cost, high quality
information retrieval for many information needs.
Unfortunately, the vast amount of information now available is
beginning to overwhelm search engine ability. In the same Urban Library
Council Study, users reported that they could count on the library for the
accuracy of the information and the help they received in finding it. In
spite of the enormous revolutions in unstructured information retrieval
the real potential for the web can only be realized when, like the
information in libraries, it is organized in some meaningful way.
Principles of Information Organization
Elaine Svenonius in her recent book The Intellectual Foundations
of Information Organization states that the purpose of information
organization is "to bring essentially like information together and to
differentiate what is not exactly alike". Before databases and the
innovations in keyword searching, this was done through classification
mechanisms such as controlled vocabularies, uniform headings, taxonomies
and classification ontologies; the association of meaningful tags and
semantic interconnections to information units for the purposes of
finding, identifying, selecting or obtaining the document at some later
time. With increasing sophistication these techniques have been employed
since the days of the Roman scroll to the current highly automated and
digitized library catalog environment. When one finds what one is looking
for in a library, it is because of the accumulated semantic associations
that these materials have "invisibly" accreted over the years.
Why can't we just catalog the web?
The process by which this semantic information is associated with a document
is called in libraries "cataloging". A unit of information is assigned
a number, often called a "call number" and information such as the book's
title, author and subject matter is associated with it. Today we call
this kind of information "metadatata" i.e. data about data. Metadata is
then used to locate information relevant to the inquiry so that the person
asking the question does not have to wander the library stacks as in Borges
Library of Babel. Since the web is now something of a vast library why
not just catalog the items in it?
For one thing, the proliferation of information and information formats
has become so profuse that the fine craft of "cataloging" is too labor
intensive and too time consuming and therefore too expensive to apply
to all resources that appear on the web. Secondly, when information is
digital, information units such as "book", "article" or "chapter" may
be broken down into even smaller units such as "paragraph", "sentence"
or even "letter" if there were some meaningful reason to do so, thus creating
even more data in need of "cataloging". And worst of all, as a colleague
who used to work at dmoz.org said "books do not change their content,
whereas web pages do it all the time. You can't make the web resources
stand still on the shelves...."
Additionally, libraries, which have successfully employed shared classification
schemes and taxonomies with standardized rules for applying them, are
no longer the only organizers of the shared information universe. Anyone
in possession of data now must manage its storage, preservation and content
if it is to be available for future use. Organizational schemes that work
well for managing bibliographic information may not be valuable for information
generated by business processes, policing and security agencies, or any
other organization generating meaningful information. There must be some
way of integrating the multiplicity of organizational schemes currently
in use or in development.
In Tim Berners Lee's original proposal to CERN he mentioned that computer
files were managed in hierarchical structures but this did not accurately
reflect the way that human minds process information. Through hyperlink
technology he was hoping to better reflect the way in which people think,
creating bridges over information ravines between one hierarchical classification
scheme and another.
Webbed structures rather than tree structures, he thought, more closely
modeled the real world.
New Techniques
Although Berners Lee's idea quickly created a new shared information
space, it was missing a critical layer of his original design, To address
this problem he distinguished between the current web which is like Borges'
Library of Babel except that there are robots to do the searching and
the "Semantic Web". The Semantic Web is a web in which declarations are
made about units of information so that like the information in a library,
it can can be brought together or differentiated or manipulated in various
useful ways. Once this layer of information is available, the potential
for transformation is as explosive as was that of the original web. Although
as an exchange format XML is essential, XML alone is not enough to insure
semantic interoperability on the web. Several technologies are being developed
that could help bring this new semantic layer to the web.
Metadata
The Dublin Core Metadata Standard, is a simple set of the fifteen essential
elements necessary to describe a wide range of networked resources. Because
the Dublin Core Metadata Element Set (DCMES) is derived in part from the
MARC record which libraries have use to encode semantic information for
machine retrieval for many years, it is a very practical way for non-librarians
or other information managers to indicate "aboutness" information in documents.
The Dublin Core standard is not the only available metadata standard.
Standards exist in many fields for which the Dublin Core is not applicable.
Examples of these standards include the Geospatial Metadata Standard,
HTML META tags, MARC the metadata standard used in libraries and PRISM,a
metadata standard for syndicating, aggregating, post-processing and multi-purposing
content from magazines, news, catalogs, books and mainstream journals.
Resource Description Framework
The Resource Description Framework or RDF is a W3C Specification used
to assign metadata to electronic resources. A resource or document as
defined in the Resource Description Framework is anything that can have
an addressable location. This location information is like the call number
of the book in the virtual library of the web. RDF then lets you assign
a property to this Resource such as "author", "title" or "publisher";
"author", "title" or "publisher" as defined by Dublin Core or some other
set of conventions. RDF then creates a statement: The author (as defined
by, say, Dublin Core) of Huckleberry Finn (as defined by a URL indicating
the location of an electronic version of this work) is Mark Twain. (as
defined by the Library of Congress Personal Name file which associates
Mark Twain with Samuel Clemens). The original architecture for the present
web transforming it into a new kind of information space, sacrificed link
integrity for scalability. In order for the Semantic Web to be equally
successful, it must allow anybody to say anything about anything. The
Resource Description Framework allows any statement to be made as long
as there are pointers to who is saying so and in what context.
XTM
The XTM standard comes from from an earlier pre-web ISO standard for
organizing information. XTM is used in the creation of Topic Maps. Topic
Maps organize large sets of information and build structured semantic
link networks over the resources. These networks allow easy and selective
navigation to requested information. Using XTM, a knowledge manager can
specify a topic and than assign various associations to the other topics
within the map and resources outside it. Thus in XTM one might start with
the Topic "Mark Twain", associate it with another Topic "Samuel Clemens"
and another Topic "Huckleberry Finn" and have the Topic "Huckleberry Finn"
point to either a URL or if it were in a library, a call number--anywhere
that this resource could be located. The association too can be defined
in various ways so that for example you can specify the exact nature of
the relationship between "Mark Twain" and "Samuel Clemens".
Although there is a degree of overlap between these two specifications,
they originate in two different information organizing practices. RDF
derives in part from book cataloging. With a unit of information "in hand"
one assigns various semantic tags so that, when it comes time to retrieve
the resource or collection of resources, the information seeker can locate
it using the various embedded or associated semantic hooks. XTM on the
other hand comes from the art of indexing. Indexing is an information
organizing technique in which the topics or "headings" or important elements
of "aboutness" are distilled from a larger "work". With indexing one moves
further into the conceptual realm creating networks of associations that
contextualize the whole. It may be possible for either of these technologies
to express either of these functions though the strengths and weaknesses
inherent in doing so have yet to be determined.
Machine to Human
The addition of semantic information to web resources would improve
information retrieval in ways yet unimagined. As Tim Bray said, search
engines "do the equivalent of going through the library, reading every
book, and allowing us to look things up based on the words in the text."
If more metadata were available, one would not, as when using Google, one
of the better search engines, have to rely on the popularity of the
resource as assurance of its relevancy. Librarians, who often act as human
mediators between the esoteric eccentricities of structured information
and the often unformulated inquiry of the information seeker know that
information retrieval is often skewed and incomplete even when information
is organized well. When organized badly or not at all, the consequences
are pitiful.
Machine to Machine
Since all of this information is now digital, why have a human in the
middle at all? Why not let "agents" mediate between the information and
the inquiry saving people the time and bother associated with navigating
the complexities of information space. Tim Berners Lee in his March 2001,
Scientific American article "the Semantic Web" illustrated how agents
using semantic information could be used to conduct research into everday
tasks such as investigating health care provider options, prescription
treatments, or available appointment times. Each of these tasks now is
usually conducted by a human researcher. If one is taking a trip, he/she
must investigate the best price for an airplane ticket, (even though some
of this information is already collocated), and match the information
about available flights with available times from a personal calendar.
This sort of research is conducted daily and one takes for granted the
mental and representational systems needed to to ask a question,
investigate an answer, pull like information together, select from it that
which is relevant to the inquiry and initiate another set of actions based
on this selection.
Artificial-intelligence researchers have been working on methods to automate
these kinds of processes for many years. They have developed several approaches
that may be applicable to the Semantic Web.
Ontologies
Ontology is a Greek word meaning "the study of being". Reality may "be"
but our perception of it is controlled in part by the mental mappings
we construct to make sense of it. As the theory goes, if a model of these
conceptual frameworks can be represented, then they can be queried. Thus
Artificial Intelligence employs Ontologies to define for the computer
what is "real". Although ontologies often include the taxonomic hierarchies
of classes found in library classification systems such as the Library
of Congress Subject Headings or Dewey Decimal System, ontologies can also
include other terms to create abstract representations of the logics of
a system.
Inference Engines
Once ontologies have been established, programs can be written to apply
logical formulations to the represented world as defined by the ontology.
Inference Engines can be used to translate this "reasoning" in order to
"prove" or verify the validity and quality of the information it is
retrieving. Inference Engines can also automate the discovery of web
services.
Neural Networks
Research in Artificial Intelligence is driven by the desire to infuse
computers with intelligence. As Science attempts to model the human mind,
one moves further from information organization towards information
processing. How exactly the human mind constructs and then makes sense of
the world it has created is a complex question with no clearly defined
answers. As the Russian/American novelist Vladimir Nabokov said "The
difference between an ape's memory and human memory is the difference
between an ampersand and the British Museum library....The stupidest
person in the world is an all-round genius compared to the cleverest
computer. How we learn to imagine and express things is a riddle with
premises impossible to express and a solution impossible to imagine."
Scientists will continue, however, to try, and we do not yet know what the
results of their efforts will be.
Electronic Data-Interchange and E-Commerce
Automated information processing of course, holds great commercial
promise. Most people would love to be spared the frustration of having to
understand how information is organized in order to effectively find what
they are looking for. Traversing the labyrinths of information space is
often exasperating and time consuming. It is common in libraries to hear
some peeved user state "this library stinks" simply because the inquiry
resulted in an answer more complicated than he/she believed necessary.
Simple searches on the web often turn up exactly what one is looking for
but who has not spent too many hours scanning long results lists in the
retreating hope of finding what one is really looking for? Any company
capable of easing this pain stands to make a fortune.
Challenges for the semantic web
To begin at the beginning, metadata must become a part of the web environment.
Despite the efforts of the W3Consortium and other pioneering thinkers,
it is very difficult to communicate the importance of this proposition
to the general public who has long taken for granted the information organizing
structures already in place. On top of that, for many users, the "keyword
search" is sufficient; the retrieval of any information rather
than a complete set of information on a given topic is preferable
if it means less frustration or time spent on the problem. The implications
for this "fast food" approach to inquiry are beyond the scope of this
paper. Convincing those in charge of the millions of web pages that make
up the current web that time and resources spent on adding semantic information
to those pages is a valuable endeavor is a conundrum; the value is only
visible if a sufficient quantity is first available. For this reason it
may be that semantic web technologies will apply inside "semantic
islands" - community of practice/knowledge webs, enterprise or industry
intranet portals, etc since semantic web technologies have to sit on shared
ontologies, and it does not make sense to have ontologies shared by all
Web resources and users.
Likewise, there are very few mainstream tools with which web editors
can add metadata, RDF or Topic Maps to their website. There are few work
processes in place. Many corporations have libraries but libraries are not
always associated with web development. Information management has been,
for the most part underappreciated; its value only apparent when, as now,
society generates more data than it can keep track of.
Future directions
There are many diverse interests that converge on the Web and the Semantic
Web can mean many things to different people. The library perspective
is a unique one. Librarians wrestle with the tension between the right
of artists, writers and creators of all communications to control the
use of their works and the right of the public to use those works for
scholarship, edification and entertainment, especially those without the
resources to access these documents directly. Librarians care not just
whether people get information but whether they get the right information;
accurate, appropriate, unbiased. Librarians insist that information from
all viewpoints be collected, that no one single interest or perspective
has priority over another. These concerns, while not of primary importance
to many of the "stakeholders" of the web, will become increasingly
important to society as the web evolves. The traditional library is already
a very different place than it was a decade ago. As each of the communities
whose current work can or will be extended by the evolution of the web
converge, it is inevitable that the web as we know it will transform into
an information space very different from the one we currently inhabit.
Which will it more resemble: a well organized, efficiently and humanely
run library or Borges' Library of Babel?
|