Introduction to the Semantic Web, Organizing the Web for Better Information 
        Retrieval
		or
      Can Librariansreally save the Web? 
       
         
       
      Paper presented at Knowledge Technologies 2002 - Seattle, March 
      2002 by Suellen Stringer-Hye- Vanderbilt University 
      Abstract 
      
      Libraries have long been storing and classifying the record 
        of the world's shared knowledge for efficient retrieval. The web, in many 
        ways, has become an extension of the library but without the structures 
        that make optimized retrieval possible. Additionally it is not possible 
        to employ traditional methods of classifying knowledge to the large sets 
        of data and information now being generated electronically. The World 
        Wide Web, in order to be truly useful, must adapt techniques used by libraries 
        for centuries as well as rely on new technologies not yet fully developed. 
        This paper will present an overview of the range of technological solutions 
        currently under consideration for the building of the "Semantic Web"--- 
        a web maximized for information and knowledge storage and retrieval.  
      Introduction
      One of the most famous libraries in the world is the Library of Babel. 
        This library is imaginary, invented by Jorge Luis Borges in his story 
        of the same name. The library of Babel is enormous as it contains all 
        of the world's knowledge but there are few librarians and no order to 
        its contents. To find anything one must wander the stacks, knowing that 
        what is sought exists, but not knowing in which of the infinite hexagons 
        in which all this knowledge is stored, to find it. The Library of Babel 
        contains all that ever was or ever will be but there is no way, to use
        a bit of library jargon, to "access" this information. Although written 
        in 1941 one can easily see the resemblance of the Web to the Library of 
        Babel. Borges notes "When it was announced that the Library contained 
        all books, the first reaction was unbounded joy . . .” but adds later, 
        “[t]hat unbridled hopefulness was succeeded, naturally enough, by a similarly 
        disproportionate depression.” Librarianship as a profession exists to 
        connect people with information, knowledge and ideas. Uniquely situated, 
        as Bonnie Nardi points out in her book Information Ecologies, librarians 
        are a keystone species, crucial to the system of people, practices, 
        values and technologies that characterize a living information space. 
        Over the years many tools have been developed for linking people and their 
        inquiries with the information to satisfactorily fulfill the request. 
        But neither is the the web exactly like a library. As Ramonet stated in 
        , ‘La Tyrannie de la communication’ (Paris: Éditions Galilée, 
        1999), we are experiencing ‘prolifération d’une information de 
        plus en plus diffusée, et de moins en moins contrôlée’ 
        (‘proliferation of information in a form which is more and more diffuse 
        and less and less subject to control’). While much can be learned from 
        the historical experience of Libraries and Library and Information Science, 
        traditional approaches to information management will need to evolve both 
        in the context of the library and the web in order to reflect this new 
        reality.  
      What is Information? 
      Halfway through the last century, information became a thing. In 1948 
        Claude Shannon published "A Mathematical Theory of Communication" in the 
        Bell System Technical Journal. From that moment forward, information 
        became a commodity, a quantity to be measured and analyzed. The meaning 
        of the content of the information played no role in this new technical 
        sense. The chaotic systems, strings of random numbers, bits and bytes 
        running through wires, though dense with information, were meaningless. 
        This definition of information fueled the modern industries of information 
        and communication delivery. Information may also be conceived, however, 
        as "something that is communicated" where communication is "a personal, 
        intentional, transactional, symbolic process for the purpose of exchanging 
        ideas, feelings, etc.". It is with this second definition that this paper 
        will be concerned, for libraries have been managing this sort of information 
        long before it traveled by wire, in circuits or found a home in a database 
        somewhere. 
         
        Of this kind of information there are two types. That which can be retrieved 
        and that which cannot. Most of what is and has been communicated by humans 
        over the ages can never be recovered. It consists in the infinity of unrecorded 
        conversations, discarded notes and jottings, telephone calls and destroyed 
        documents that have melted away like the snows. What remains are the few 
        mounds from which we may attempt to extract and make sense of the utterances 
        of our ancestors. Though the pace, quantity and methodology for accumulation 
        has perhaps accelerated, the process remains the same---humans communicate 
        and sometimes record, store and eventually preserve that communication. 
        Once it has been recorded it can be recovered. For most of "recorded" 
        history, communications were in the form of written documents and material 
        objects. Only in the last 200 years have we seen the proliferation of 
        non-text materials such as sound recordings, films, videos and photographs. 
        Much of this material can be turned into bits and bytes, blurring the 
        distinction between and the primacy of text based information. We now 
        live in a world in which most recorded information can be, if not always 
        is, digitized.  
       
      The Web and the Library 
      When Tim Berners Lee first proposed the concepts that later become the 
        World Wide Web, he was concerned with information loss. Knowledge at the 
        CERN laboratories in Switzerland, where he was employed, was shared through 
        conversation and social interaction but there was no means by which this 
        information, even if it had been recorded, could be retrieved, especially 
        in a manner that mimicked the flexibility and dynamism of an open organization 
        such as CERN. When the dream of a "universal linked information space" 
        became a reality larger than perhaps even he imagined, libraries encountered 
        the first real competition that they had ever faced. People now had access 
        to "the world of information" or rather "a world of information" without 
        needing the institution of the library to collect it for them. In 2000, 
        the Urban Library Council conducted a consumer market analyses of library 
        and web services. They determined that users (or patrons as they are called 
        in the library community) found the web more private, more entertaining, 
        easier to use, with more current information available at more times than 
        the library. Efficient search engines employing sophisticated algorithms 
        and natural language processing have delivered low cost, high quality 
        information retrieval for many information needs.  
       
      Unfortunately, the vast amount of information now available is 
      beginning to overwhelm search engine ability. In the same Urban Library 
      Council Study, users reported that they could count on the library for the 
      accuracy of the information and the help they received in finding it. In 
      spite of the enormous revolutions in unstructured information retrieval 
      the real potential for the web can only be realized when, like the 
      information in libraries, it is organized in some meaningful way.  
       Principles of Information Organization
       Elaine Svenonius in her recent book The Intellectual Foundations 
      of Information Organization states that the purpose of information 
      organization is "to bring essentially like information together and to 
      differentiate what is not exactly alike". Before databases and the 
      innovations in keyword searching, this was done through classification 
      mechanisms such as controlled vocabularies, uniform headings, taxonomies 
      and classification ontologies; the association of meaningful tags and 
      semantic interconnections to information units for the purposes of 
      finding, identifying, selecting or obtaining the document at some later 
      time. With increasing sophistication these techniques have been employed 
      since the days of the Roman scroll to the current highly automated and 
      digitized library catalog environment. When one finds what one is looking 
      for in a library, it is because of the accumulated semantic associations 
      that these materials have "invisibly" accreted over the years.  
       Why can't we just catalog the web?
      The process by which this semantic information is associated with a document 
        is called in libraries "cataloging". A unit of information is assigned 
        a number, often called a "call number" and information such as the book's 
        title, author and subject matter is associated with it. Today we call 
        this kind of information "metadatata" i.e. data about data. Metadata is 
        then used to locate information relevant to the inquiry so that the person 
        asking the question does not have to wander the library stacks as in Borges 
        Library of Babel. Since the web is now something of a vast library why 
        not just catalog the items in it?  
      For one thing, the proliferation of information and information formats 
        has become so profuse that the fine craft of "cataloging" is too labor 
        intensive and too time consuming and therefore too expensive to apply 
        to all resources that appear on the web. Secondly, when information is 
        digital, information units such as "book", "article" or "chapter" may 
        be broken down into even smaller units such as "paragraph", "sentence" 
        or even "letter" if there were some meaningful reason to do so, thus creating 
        even more data in need of "cataloging". And worst of all, as a colleague 
        who used to work at dmoz.org said "books do not change their content, 
        whereas web pages do it all the time. You can't make the web resources 
        stand still on the shelves...."  
       
      Additionally, libraries, which have successfully employed shared classification 
        schemes and taxonomies with standardized rules for applying them, are 
        no longer the only organizers of the shared information universe. Anyone 
        in possession of data now must manage its storage, preservation and content 
        if it is to be available for future use. Organizational schemes that work 
        well for managing bibliographic information may not be valuable for information 
        generated by business processes, policing and security agencies, or any 
        other organization generating meaningful information. There must be some 
        way of integrating the multiplicity of organizational schemes currently 
        in use or in development.  
       
      In Tim Berners Lee's original proposal to CERN he mentioned that computer 
        files were managed in hierarchical structures but this did not accurately 
        reflect the way that human minds process information. Through hyperlink 
        technology he was hoping to better reflect the way in which people think, 
        creating bridges over information ravines between one hierarchical classification 
        scheme and another.  
        Webbed structures rather than tree structures, he thought, more closely 
        modeled the real world.  
      New Techniques
      Although Berners Lee's idea quickly created a new shared information 
        space, it was missing a critical layer of his original design, To address 
        this problem he distinguished between the current web which is like Borges' 
        Library of Babel except that there are robots to do the searching and 
        the "Semantic Web". The Semantic Web is a web in which declarations are 
        made about units of information so that like the information in a library, 
        it can can be brought together or differentiated or manipulated in various 
        useful ways. Once this layer of information is available, the potential 
        for transformation is as explosive as was that of the original web. Although 
        as an exchange format XML is essential, XML alone is not enough to insure 
        semantic interoperability on the web. Several technologies are being developed 
        that could help bring this new semantic layer to the web.  
      Metadata
      The Dublin Core Metadata Standard, is a simple set of the fifteen essential 
        elements necessary to describe a wide range of networked resources. Because 
        the Dublin Core Metadata Element Set (DCMES) is derived in part from the 
        MARC record which libraries have use to encode semantic information for 
        machine retrieval for many years, it is a very practical way for non-librarians 
        or other information managers to indicate "aboutness" information in documents. 
        The Dublin Core standard is not the only available metadata standard. 
        Standards exist in many fields for which the Dublin Core is not applicable. 
        Examples of these standards include the Geospatial Metadata Standard, 
        HTML META tags, MARC the metadata standard used in libraries and PRISM,a 
        metadata standard for syndicating, aggregating, post-processing and multi-purposing 
        content from magazines, news, catalogs, books and mainstream journals. 
       
      Resource Description Framework
      The Resource Description Framework or RDF is a W3C Specification used 
        to assign metadata to electronic resources. A resource or document as 
        defined in the Resource Description Framework is anything that can have 
        an addressable location. This location information is like the call number 
        of the book in the virtual library of the web. RDF then lets you assign 
        a property to this Resource such as "author", "title" or "publisher"; 
        "author", "title" or "publisher" as defined by Dublin Core or some other 
        set of conventions. RDF then creates a statement: The author (as defined 
        by, say, Dublin Core) of Huckleberry Finn (as defined by a URL indicating 
        the location of an electronic version of this work) is Mark Twain. (as 
        defined by the Library of Congress Personal Name file which associates 
        Mark Twain with Samuel Clemens). The original architecture for the present 
        web transforming it into a new kind of information space, sacrificed link 
        integrity for scalability. In order for the Semantic Web to be equally 
        successful, it must allow anybody to say anything about anything. The 
        Resource Description Framework allows any statement to be made as long 
        as there are pointers to who is saying so and in what context. 
      XTM 
        
      The XTM standard comes from from an earlier pre-web ISO standard for 
        organizing information. XTM is used in the creation of Topic Maps. Topic 
        Maps organize large sets of information and build structured semantic 
        link networks over the resources. These networks allow easy and selective 
        navigation to requested information. Using XTM, a knowledge manager can 
        specify a topic and than assign various associations to the other topics 
        within the map and resources outside it. Thus in XTM one might start with 
        the Topic "Mark Twain", associate it with another Topic "Samuel Clemens" 
        and another Topic "Huckleberry Finn" and have the Topic "Huckleberry Finn" 
        point to either a URL or if it were in a library, a call number--anywhere 
        that this resource could be located. The association too can be defined 
        in various ways so that for example you can specify the exact nature of 
        the relationship between "Mark Twain" and "Samuel Clemens".  
      Although there is a degree of overlap between these two specifications, 
        they originate in two different information organizing practices. RDF 
        derives in part from book cataloging. With a unit of information "in hand" 
        one assigns various semantic tags so that, when it comes time to retrieve 
        the resource or collection of resources, the information seeker can locate 
        it using the various embedded or associated semantic hooks. XTM on the 
        other hand comes from the art of indexing. Indexing is an information 
        organizing technique in which the topics or "headings" or important elements 
        of "aboutness" are distilled from a larger "work". With indexing one moves 
        further into the conceptual realm creating networks of associations that 
        contextualize the whole. It may be possible for either of these technologies 
        to express either of these functions though the strengths and weaknesses 
        inherent in doing so have yet to be determined.  
       Machine to Human 
      The addition of semantic information to web resources would improve 
      information retrieval in ways yet unimagined. As Tim Bray said, search 
      engines "do the equivalent of going through the library, reading every 
      book, and allowing us to look things up based on the words in the text." 
      If more metadata were available, one would not, as when using Google, one 
      of the better search engines, have to rely on the popularity of the 
      resource as assurance of its relevancy. Librarians, who often act as human 
      mediators between the esoteric eccentricities of structured information 
      and the often unformulated inquiry of the information seeker know that 
      information retrieval is often skewed and incomplete even when information 
      is organized well. When organized badly or not at all, the consequences 
      are pitiful.  
       Machine to Machine
      Since all of this information is now digital, why have a human in the 
      middle at all? Why not let "agents" mediate between the information and 
      the inquiry saving people the time and bother associated with navigating 
      the complexities of information space. Tim Berners Lee in his March 2001, 
      Scientific American article "the Semantic Web" illustrated how agents 
      using semantic information could be used to conduct research into everday 
      tasks such as investigating health care provider options, prescription 
      treatments, or available appointment times. Each of these tasks now is 
      usually conducted by a human researcher. If one is taking a trip, he/she 
      must investigate the best price for an airplane ticket, (even though some 
      of this information is already collocated), and match the information 
      about available flights with available times from a personal calendar. 
      This sort of research is conducted daily and one takes for granted the 
      mental and representational systems needed to to ask a question, 
      investigate an answer, pull like information together, select from it that 
      which is relevant to the inquiry and initiate another set of actions based 
      on this selection.  
      Artificial-intelligence researchers have been working on methods to automate 
        these kinds of processes for many years. They have developed several approaches 
        that may be applicable to the Semantic Web.  
       
      Ontologies
      Ontology is a Greek word meaning "the study of being". Reality may "be" 
        but our perception of it is controlled in part by the mental mappings 
        we construct to make sense of it. As the theory goes, if a model of these 
        conceptual frameworks can be represented, then they can be queried. Thus 
        Artificial Intelligence employs Ontologies to define for the computer 
        what is "real". Although ontologies often include the taxonomic hierarchies 
        of classes found in library classification systems such as the Library 
        of Congress Subject Headings or Dewey Decimal System, ontologies can also 
        include other terms to create abstract representations of the logics of 
        a system.  
       
      Inference Engines 
      Once ontologies have been established, programs can be written to apply 
      logical formulations to the represented world as defined by the ontology. 
      Inference Engines can be used to translate this "reasoning" in order to 
      "prove" or verify the validity and quality of the information it is 
      retrieving. Inference Engines can also automate the discovery of web 
      services.  
      Neural Networks
      Research in Artificial Intelligence is driven by the desire to infuse 
      computers with intelligence. As Science attempts to model the human mind, 
      one moves further from information organization towards information 
      processing. How exactly the human mind constructs and then makes sense of 
      the world it has created is a complex question with no clearly defined 
      answers. As the Russian/American novelist Vladimir Nabokov said "The 
      difference between an ape's memory and human memory is the difference 
      between an ampersand and the British Museum library....The stupidest 
      person in the world is an all-round genius compared to the cleverest 
      computer. How we learn to imagine and express things is a riddle with 
      premises impossible to express and a solution impossible to imagine." 
      Scientists will continue, however, to try, and we do not yet know what the 
      results of their efforts will be. 
 
  
      Electronic Data-Interchange and E-Commerce
      Automated information processing of course, holds great commercial 
      promise. Most people would love to be spared the frustration of having to 
      understand how information is organized in order to effectively find what 
      they are looking for. Traversing the labyrinths of information space is 
      often exasperating and time consuming. It is common in libraries to hear 
      some peeved user state "this library stinks" simply because the inquiry 
      resulted in an answer more complicated than he/she believed necessary. 
      Simple searches on the web often turn up exactly what one is looking for 
      but who has not spent too many hours scanning long results lists in the 
      retreating hope of finding what one is really looking for? Any company 
      capable of easing this pain stands to make a fortune.  
      Challenges for the semantic web
      To begin at the beginning, metadata must become a part of the web environment. 
        Despite the efforts of the W3Consortium and other pioneering thinkers, 
        it is very difficult to communicate the importance of this proposition 
        to the general public who has long taken for granted the information organizing 
        structures already in place. On top of that, for many users, the "keyword 
        search" is sufficient; the retrieval of any information rather 
        than a complete set of information on a given topic is preferable 
        if it means less frustration or time spent on the problem. The implications 
        for this "fast food" approach to inquiry are beyond the scope of this 
        paper. Convincing those in charge of the millions of web pages that make 
        up the current web that time and resources spent on adding semantic information 
        to those pages is a valuable endeavor is a conundrum; the value is only 
        visible if a sufficient quantity is first available. For this reason it 
        may be that semantic web technologies will apply inside "semantic 
        islands" - community of practice/knowledge webs, enterprise or industry 
        intranet portals, etc since semantic web technologies have to sit on shared 
        ontologies, and it does not make sense to have ontologies shared by all 
        Web resources and users. 
         
      Likewise, there are very few mainstream tools with which web editors 
      can add metadata, RDF or Topic Maps to their website. There are few work 
      processes in place. Many corporations have libraries but libraries are not 
      always associated with web development. Information management has been, 
      for the most part underappreciated; its value only apparent when, as now, 
      society generates more data than it can keep track of.  
      Future directions
      There are many diverse interests that converge on the Web and the Semantic 
        Web can mean many things to different people. The library perspective 
        is a unique one. Librarians wrestle with the tension between the right 
        of artists, writers and creators of all communications to control the 
        use of their works and the right of the public to use those works for 
        scholarship, edification and entertainment, especially those without the 
        resources to access these documents directly. Librarians care not just 
        whether people get information but whether they get the right information; 
        accurate, appropriate, unbiased. Librarians insist that information from 
        all viewpoints be collected, that no one single interest or perspective 
        has priority over another. These concerns, while not of primary importance 
        to many of the "stakeholders" of the web, will become increasingly 
        important to society as the web evolves. The traditional library is already 
        a very different place than it was a decade ago. As each of the communities 
        whose current work can or will be extended by the evolution of the web 
        converge, it is inevitable that the web as we know it will transform into 
        an information space very different from the one we currently inhabit. 
        Which will it more resemble: a well organized, efficiently and humanely 
        run library or Borges' Library of Babel?  
       |