Introduction to the Semantic Web, Organizing the Web for Better Information Retrieval

or

Can Librariansreally save the Web?


Paper presented at Knowledge Technologies 2002 - Seattle, March 2002
by Suellen Stringer-Hye- Vanderbilt University

Abstract

Libraries have long been storing and classifying the record of the world's shared knowledge for efficient retrieval. The web, in many ways, has become an extension of the library but without the structures that make optimized retrieval possible. Additionally it is not possible to employ traditional methods of classifying knowledge to the large sets of data and information now being generated electronically. The World Wide Web, in order to be truly useful, must adapt techniques used by libraries for centuries as well as rely on new technologies not yet fully developed. This paper will present an overview of the range of technological solutions currently under consideration for the building of the "Semantic Web"--- a web maximized for information and knowledge storage and retrieval.

Introduction

One of the most famous libraries in the world is the Library of Babel. This library is imaginary, invented by Jorge Luis Borges in his story of the same name. The library of Babel is enormous as it contains all of the world's knowledge but there are few librarians and no order to its contents. To find anything one must wander the stacks, knowing that what is sought exists, but not knowing in which of the infinite hexagons in which all this knowledge is stored, to find it. The Library of Babel contains all that ever was or ever will be but there is no way, to use a bit of library jargon, to "access" this information. Although written in 1941 one can easily see the resemblance of the Web to the Library of Babel. Borges notes "When it was announced that the Library contained all books, the first reaction was unbounded joy . . .” but adds later, “[t]hat unbridled hopefulness was succeeded, naturally enough, by a similarly disproportionate depression.” Librarianship as a profession exists to connect people with information, knowledge and ideas. Uniquely situated, as Bonnie Nardi points out in her book Information Ecologies, librarians are a keystone species, crucial to the system of people, practices, values and technologies that characterize a living information space. Over the years many tools have been developed for linking people and their inquiries with the information to satisfactorily fulfill the request. But neither is the the web exactly like a library. As Ramonet stated in , ‘La Tyrannie de la communication’ (Paris: Éditions Galilée, 1999), we are experiencing ‘prolifération d’une information de plus en plus diffusée, et de moins en moins contrôlée’ (‘proliferation of information in a form which is more and more diffuse and less and less subject to control’). While much can be learned from the historical experience of Libraries and Library and Information Science, traditional approaches to information management will need to evolve both in the context of the library and the web in order to reflect this new reality.

What is Information?

Halfway through the last century, information became a thing. In 1948 Claude Shannon published "A Mathematical Theory of Communication" in the Bell System Technical Journal. From that moment forward, information became a commodity, a quantity to be measured and analyzed. The meaning of the content of the information played no role in this new technical sense. The chaotic systems, strings of random numbers, bits and bytes running through wires, though dense with information, were meaningless. This definition of information fueled the modern industries of information and communication delivery. Information may also be conceived, however, as "something that is communicated" where communication is "a personal, intentional, transactional, symbolic process for the purpose of exchanging ideas, feelings, etc.". It is with this second definition that this paper will be concerned, for libraries have been managing this sort of information long before it traveled by wire, in circuits or found a home in a database somewhere.

Of this kind of information there are two types. That which can be retrieved and that which cannot. Most of what is and has been communicated by humans over the ages can never be recovered. It consists in the infinity of unrecorded conversations, discarded notes and jottings, telephone calls and destroyed documents that have melted away like the snows. What remains are the few mounds from which we may attempt to extract and make sense of the utterances of our ancestors. Though the pace, quantity and methodology for accumulation has perhaps accelerated, the process remains the same---humans communicate and sometimes record, store and eventually preserve that communication. Once it has been recorded it can be recovered. For most of "recorded" history, communications were in the form of written documents and material objects. Only in the last 200 years have we seen the proliferation of non-text materials such as sound recordings, films, videos and photographs. Much of this material can be turned into bits and bytes, blurring the distinction between and the primacy of text based information. We now live in a world in which most recorded information can be, if not always is, digitized.

The Web and the Library

When Tim Berners Lee first proposed the concepts that later become the World Wide Web, he was concerned with information loss. Knowledge at the CERN laboratories in Switzerland, where he was employed, was shared through conversation and social interaction but there was no means by which this information, even if it had been recorded, could be retrieved, especially in a manner that mimicked the flexibility and dynamism of an open organization such as CERN. When the dream of a "universal linked information space" became a reality larger than perhaps even he imagined, libraries encountered the first real competition that they had ever faced. People now had access to "the world of information" or rather "a world of information" without needing the institution of the library to collect it for them. In 2000, the Urban Library Council conducted a consumer market analyses of library and web services. They determined that users (or patrons as they are called in the library community) found the web more private, more entertaining, easier to use, with more current information available at more times than the library. Efficient search engines employing sophisticated algorithms and natural language processing have delivered low cost, high quality information retrieval for many information needs.

Unfortunately, the vast amount of information now available is beginning to overwhelm search engine ability. In the same Urban Library Council Study, users reported that they could count on the library for the accuracy of the information and the help they received in finding it. In spite of the enormous revolutions in unstructured information retrieval the real potential for the web can only be realized when, like the information in libraries, it is organized in some meaningful way.


Principles of Information Organization


Elaine Svenonius in her recent book The Intellectual Foundations of Information Organization states that the purpose of information organization is "to bring essentially like information together and to differentiate what is not exactly alike". Before databases and the innovations in keyword searching, this was done through classification mechanisms such as controlled vocabularies, uniform headings, taxonomies and classification ontologies; the association of meaningful tags and semantic interconnections to information units for the purposes of finding, identifying, selecting or obtaining the document at some later time. With increasing sophistication these techniques have been employed since the days of the Roman scroll to the current highly automated and digitized library catalog environment. When one finds what one is looking for in a library, it is because of the accumulated semantic associations that these materials have "invisibly" accreted over the years.


Why can't we just catalog the web?

The process by which this semantic information is associated with a document is called in libraries "cataloging". A unit of information is assigned a number, often called a "call number" and information such as the book's title, author and subject matter is associated with it. Today we call this kind of information "metadatata" i.e. data about data. Metadata is then used to locate information relevant to the inquiry so that the person asking the question does not have to wander the library stacks as in Borges Library of Babel. Since the web is now something of a vast library why not just catalog the items in it?

For one thing, the proliferation of information and information formats has become so profuse that the fine craft of "cataloging" is too labor intensive and too time consuming and therefore too expensive to apply to all resources that appear on the web. Secondly, when information is digital, information units such as "book", "article" or "chapter" may be broken down into even smaller units such as "paragraph", "sentence" or even "letter" if there were some meaningful reason to do so, thus creating even more data in need of "cataloging". And worst of all, as a colleague who used to work at dmoz.org said "books do not change their content, whereas web pages do it all the time. You can't make the web resources stand still on the shelves...."

Additionally, libraries, which have successfully employed shared classification schemes and taxonomies with standardized rules for applying them, are no longer the only organizers of the shared information universe. Anyone in possession of data now must manage its storage, preservation and content if it is to be available for future use. Organizational schemes that work well for managing bibliographic information may not be valuable for information generated by business processes, policing and security agencies, or any other organization generating meaningful information. There must be some way of integrating the multiplicity of organizational schemes currently in use or in development.

In Tim Berners Lee's original proposal to CERN he mentioned that computer files were managed in hierarchical structures but this did not accurately reflect the way that human minds process information. Through hyperlink technology he was hoping to better reflect the way in which people think, creating bridges over information ravines between one hierarchical classification scheme and another.
Webbed structures rather than tree structures, he thought, more closely modeled the real world.

New Techniques

Although Berners Lee's idea quickly created a new shared information space, it was missing a critical layer of his original design, To address this problem he distinguished between the current web which is like Borges' Library of Babel except that there are robots to do the searching and the "Semantic Web". The Semantic Web is a web in which declarations are made about units of information so that like the information in a library, it can can be brought together or differentiated or manipulated in various useful ways. Once this layer of information is available, the potential for transformation is as explosive as was that of the original web. Although as an exchange format XML is essential, XML alone is not enough to insure semantic interoperability on the web. Several technologies are being developed that could help bring this new semantic layer to the web.

Metadata

The Dublin Core Metadata Standard, is a simple set of the fifteen essential elements necessary to describe a wide range of networked resources. Because the Dublin Core Metadata Element Set (DCMES) is derived in part from the MARC record which libraries have use to encode semantic information for machine retrieval for many years, it is a very practical way for non-librarians or other information managers to indicate "aboutness" information in documents. The Dublin Core standard is not the only available metadata standard. Standards exist in many fields for which the Dublin Core is not applicable. Examples of these standards include the Geospatial Metadata Standard, HTML META tags, MARC the metadata standard used in libraries and PRISM,a metadata standard for syndicating, aggregating, post-processing and multi-purposing content from magazines, news, catalogs, books and mainstream journals.

Resource Description Framework

The Resource Description Framework or RDF is a W3C Specification used to assign metadata to electronic resources. A resource or document as defined in the Resource Description Framework is anything that can have an addressable location. This location information is like the call number of the book in the virtual library of the web. RDF then lets you assign a property to this Resource such as "author", "title" or "publisher"; "author", "title" or "publisher" as defined by Dublin Core or some other set of conventions. RDF then creates a statement: The author (as defined by, say, Dublin Core) of Huckleberry Finn (as defined by a URL indicating the location of an electronic version of this work) is Mark Twain. (as defined by the Library of Congress Personal Name file which associates Mark Twain with Samuel Clemens). The original architecture for the present web transforming it into a new kind of information space, sacrificed link integrity for scalability. In order for the Semantic Web to be equally successful, it must allow anybody to say anything about anything. The Resource Description Framework allows any statement to be made as long as there are pointers to who is saying so and in what context.

XTM

The XTM standard comes from from an earlier pre-web ISO standard for organizing information. XTM is used in the creation of Topic Maps. Topic Maps organize large sets of information and build structured semantic link networks over the resources. These networks allow easy and selective navigation to requested information. Using XTM, a knowledge manager can specify a topic and than assign various associations to the other topics within the map and resources outside it. Thus in XTM one might start with the Topic "Mark Twain", associate it with another Topic "Samuel Clemens" and another Topic "Huckleberry Finn" and have the Topic "Huckleberry Finn" point to either a URL or if it were in a library, a call number--anywhere that this resource could be located. The association too can be defined in various ways so that for example you can specify the exact nature of the relationship between "Mark Twain" and "Samuel Clemens".

Although there is a degree of overlap between these two specifications, they originate in two different information organizing practices. RDF derives in part from book cataloging. With a unit of information "in hand" one assigns various semantic tags so that, when it comes time to retrieve the resource or collection of resources, the information seeker can locate it using the various embedded or associated semantic hooks. XTM on the other hand comes from the art of indexing. Indexing is an information organizing technique in which the topics or "headings" or important elements of "aboutness" are distilled from a larger "work". With indexing one moves further into the conceptual realm creating networks of associations that contextualize the whole. It may be possible for either of these technologies to express either of these functions though the strengths and weaknesses inherent in doing so have yet to be determined.

Machine to Human

The addition of semantic information to web resources would improve information retrieval in ways yet unimagined. As Tim Bray said, search engines "do the equivalent of going through the library, reading every book, and allowing us to look things up based on the words in the text." If more metadata were available, one would not, as when using Google, one of the better search engines, have to rely on the popularity of the resource as assurance of its relevancy. Librarians, who often act as human mediators between the esoteric eccentricities of structured information and the often unformulated inquiry of the information seeker know that information retrieval is often skewed and incomplete even when information is organized well. When organized badly or not at all, the consequences are pitiful.


Machine to Machine

Since all of this information is now digital, why have a human in the middle at all? Why not let "agents" mediate between the information and the inquiry saving people the time and bother associated with navigating the complexities of information space. Tim Berners Lee in his March 2001, Scientific American article "the Semantic Web" illustrated how agents using semantic information could be used to conduct research into everday tasks such as investigating health care provider options, prescription treatments, or available appointment times. Each of these tasks now is usually conducted by a human researcher. If one is taking a trip, he/she must investigate the best price for an airplane ticket, (even though some of this information is already collocated), and match the information about available flights with available times from a personal calendar. This sort of research is conducted daily and one takes for granted the mental and representational systems needed to to ask a question, investigate an answer, pull like information together, select from it that which is relevant to the inquiry and initiate another set of actions based on this selection.

Artificial-intelligence researchers have been working on methods to automate these kinds of processes for many years. They have developed several approaches that may be applicable to the Semantic Web.

Ontologies

Ontology is a Greek word meaning "the study of being". Reality may "be" but our perception of it is controlled in part by the mental mappings we construct to make sense of it. As the theory goes, if a model of these conceptual frameworks can be represented, then they can be queried. Thus Artificial Intelligence employs Ontologies to define for the computer what is "real". Although ontologies often include the taxonomic hierarchies of classes found in library classification systems such as the Library of Congress Subject Headings or Dewey Decimal System, ontologies can also include other terms to create abstract representations of the logics of a system.

Inference Engines

Once ontologies have been established, programs can be written to apply logical formulations to the represented world as defined by the ontology. Inference Engines can be used to translate this "reasoning" in order to "prove" or verify the validity and quality of the information it is retrieving. Inference Engines can also automate the discovery of web services.

Neural Networks

Research in Artificial Intelligence is driven by the desire to infuse computers with intelligence. As Science attempts to model the human mind, one moves further from information organization towards information processing. How exactly the human mind constructs and then makes sense of the world it has created is a complex question with no clearly defined answers. As the Russian/American novelist Vladimir Nabokov said "The difference between an ape's memory and human memory is the difference between an ampersand and the British Museum library....The stupidest person in the world is an all-round genius compared to the cleverest computer. How we learn to imagine and express things is a riddle with premises impossible to express and a solution impossible to imagine." Scientists will continue, however, to try, and we do not yet know what the results of their efforts will be.

Electronic Data-Interchange and E-Commerce

Automated information processing of course, holds great commercial promise. Most people would love to be spared the frustration of having to understand how information is organized in order to effectively find what they are looking for. Traversing the labyrinths of information space is often exasperating and time consuming. It is common in libraries to hear some peeved user state "this library stinks" simply because the inquiry resulted in an answer more complicated than he/she believed necessary. Simple searches on the web often turn up exactly what one is looking for but who has not spent too many hours scanning long results lists in the retreating hope of finding what one is really looking for? Any company capable of easing this pain stands to make a fortune.

Challenges for the semantic web

To begin at the beginning, metadata must become a part of the web environment. Despite the efforts of the W3Consortium and other pioneering thinkers, it is very difficult to communicate the importance of this proposition to the general public who has long taken for granted the information organizing structures already in place. On top of that, for many users, the "keyword search" is sufficient; the retrieval of any information rather than a complete set of information on a given topic is preferable if it means less frustration or time spent on the problem. The implications for this "fast food" approach to inquiry are beyond the scope of this paper. Convincing those in charge of the millions of web pages that make up the current web that time and resources spent on adding semantic information to those pages is a valuable endeavor is a conundrum; the value is only visible if a sufficient quantity is first available. For this reason it may be that semantic web technologies will apply inside "semantic islands" - community of practice/knowledge webs, enterprise or industry intranet portals, etc since semantic web technologies have to sit on shared ontologies, and it does not make sense to have ontologies shared by all Web resources and users.

Likewise, there are very few mainstream tools with which web editors can add metadata, RDF or Topic Maps to their website. There are few work processes in place. Many corporations have libraries but libraries are not always associated with web development. Information management has been, for the most part underappreciated; its value only apparent when, as now, society generates more data than it can keep track of.

Future directions

There are many diverse interests that converge on the Web and the Semantic Web can mean many things to different people. The library perspective is a unique one. Librarians wrestle with the tension between the right of artists, writers and creators of all communications to control the use of their works and the right of the public to use those works for scholarship, edification and entertainment, especially those without the resources to access these documents directly. Librarians care not just whether people get information but whether they get the right information; accurate, appropriate, unbiased. Librarians insist that information from all viewpoints be collected, that no one single interest or perspective has priority over another. These concerns, while not of primary importance to many of the "stakeholders" of the web, will become increasingly important to society as the web evolves. The traditional library is already a very different place than it was a decade ago. As each of the communities whose current work can or will be extended by the evolution of the web converge, it is inevitable that the web as we know it will transform into an information space very different from the one we currently inhabit. Which will it more resemble: a well organized, efficiently and humanely run library or Borges' Library of Babel?