Structuring and Indexing the Internet

Ray Denenberg, Library of Congress
[email protected]
December, 1996

This paper was presented as the keynote address at the Workshop on Earth Observation Catalogue Interoperability sponsored by the European Space Agency and EC Centre for Earth Observation, 14-15 November 1996, in Ispra, Varese, Italy. When asked to speak on this topic, Structuring and Indexing the Internet, my initial response was that I felt it pretentious to address a topic so broad and nebulous. I was graciously offered the opportunity to change the title, but never got around to it. A more suitable title for this paper (though I was not cynical enough to propose it) would be "The Futility of contemplating trying to Structure and/or Index the Internet".

Abstract

Indexing the Internet (the global index model), currently the fashionable approach to information discovery, is at the bottom of the spectrum of semantic interoperability. A viable alternative is distributed searching, where sources maintain local indexes. Structuring the Internet, in a manner lending itself to effective navigation across distributed repositories (the navigation model), is perhaps highest along the spectrum of semantic interoperability.
The global index model has limitations; the most severe are lack of support for fielded searching, and the fact that mainly web pages get indexed while most real documents and objects do not.
Distributed searching overcomes problems of the global index model but has its own limitations, primarily: The advertisement model, popular in the global index scenario, can be defeated. Secondly, when raw, ranked results are merged, the resultant rankings are unlikely to be meaningful.
The navigation model, though highest along the semantic interoperability spectrum, comes at significant cost. Effective navigation requires coherently organized collections, which requires intellectual resources for aggregation, organization, and description. In addition, cataloging resources (human or automated) are required to create necessary descriptive records.
Z39.50 profiles are developed both for the distributed searching and navigation models.

Indexing Vs. Structuring

Indexing and structuring are two different approaches to the problem of information discovery. This is not to say they cannot be used effectively in combination, but for simplicity, I will address them separately.
Indexing, as in "Indexing the Internet", is the approach to information discovery currently in vogue, but a viable alternative is structuring information, in a way that lends itself to effective navigation across distributed repositories, and is an approach that I think provides more satisfying results. Unfortunately, it is not an approach that has gained much favor within the Internet community, certainly not for web access.
For purposes of this discussion, information discovery means locating objects of interest when the population of object from which to choose is potentially widely distributed. The nature of the distribution can range from chaotic to highly-organized. At one extreme, there may be a wealth of information available on a topic of interest, but the information is scattered haphazardly and randomly, perhaps across the Internet.
On the other hand, objects may be aggregated into collections, organized thematically (by subject, creator, historical period, etc.). Although the objects in a collection may be distributed, the collection appears logically cohesive because metadata structures are associated with the collection, supporting coherent navigation.

Search Vs. Navigation

These two models of distribution, haphazard versus organized, are associated with different models of information discovery: searching and navigation.
In a typical search scenario, a user sends a query to an information source and is subsequently presented with a summary of results. The information source might be an intermediary, or meta-searcher, that selects and relays the query to several real sources, retrieves individual result sets, and merges them into a single set. Whether the query is sent to a single end-source (the global index model) or multiple sources (the distributed search model) is a crucial distinction, but it is not relevant to the typical search scenario, from an end-user point-of-view. In either case the user receives perhaps a one-line summary for each relevant document, then, perhaps based simply on the number of documents, narrows (or broadens) and re-submits the query. This process may continue iteratively until the result size is manageable. The user might then select a document for retrieval.
In a navigation scenario, a user maneuvers or is guided from node to node along a global information tree until arriving at a document of interest.
Browsing is probably a more familiar information discovery behavior than navigation, particularly on the web. Browsing refers to the unstructured, serendipitous process whereby a user, hoping to arrive at a document of interest, maneuvers from link-to-link (or page-to-page) without any informed tools to help decide what link to follow. Navigation is perhaps best described as value-added browsing, or browsing with tools.
For simplicity, I will consider the two models, search and navigation, separately, though in the real world, distribution models may be hybrids between haphazard and organized, and correspondingly, searching and navigation can be used effectively in combination.

Indexes Vs. Meta-Structures

Corresponding to these two behavioral models, search and navigation, are data models. The search model is based on indexes; the navigation model is based on meta-structures.

Indexes

Indexing is central to any search model, whether searching a single- source or distributed information. In a typical search model, a query is applied to one or more databases, and there are often indexes associated with each database, that facilitate or optimize searching. In a given implementation a physical index need not exist, but indexing is an essential modelling abstraction, so for this discussion we assume the existence of indexes.

Global Index Vs. Local Indexes

I will distill the discussion of indexing to two broad models: the single, global index, versus independently maintained local indexes. However, there are a few complications to dispense with first.
To begin, there is no single, global, index. Rather, there are several competing, redundant global indexes. For simplicity, assume just one.
Secondly, whether global or local, there may be a single, flat index supporting raw-text searching, or several indexes, corresponding to different fields, supporting fielded searching, for example to search on author, title, or subject. The current model of the web is a single, logical agglomeration of documents, with a single, logical flat index of its entire content. So for this discussion, assume no fielded searching for the global index model.
Finally, to further complicate matters, a global index might logically be a single, unified index, but phys- ically distributed. This model, distributed indexing, is a special case of the global index model. It should not be confused with distributed searching, discussed below, which is based on the local index model.

Distributed Indexing

As I noted, distributed indexing may be dispensed with, as a special case of the global index model. First though, since distributed indexing has been a fashionable topic lately, I offer a few observations.
To begin, the issue of whether and how a global, shared index is to be distributed and managed is far from solved, and might not be solvable.
Secondly, distributed indexing is not just a technical problem. A large index may be a significant corporate asset, and the business implications of distribution are significant.
And finally, despite these barriers we cannot simply dismiss distributed indexing as an unsolvable problem, unless we also dismiss the global index model. Most experts agree that "indexing the Internet", or even any significant sector, into a single, physical index, is not a viable concept, and some form of distribution must be employed.

Three Models

To synthesize, we have three discovery models: global index, distributed searching, and navigation. In all cases, information is distributed. The global index model of searching absorbs the distributed index model. Distributed searching applies when each source maintains its own index. Navigation applies when there are coherent organizing structures imposed on the data.
I now consider each of these models separately, though I continue to stress, real-world approaches will likely be hybrids of these three models.

Global-index Issues

A global index is created through a process where some computer program collects indexing information from the various sources to be indexed. The program is referred to as a crawler (or robot, or spider). It traverses the web's global hypertext structure, recursively retrieving referenced pages. Actually, there are several of these crawlers, employing different traversal strategies. They create independent indexes, each, in a sense, duplicating the work of the others. (Note: by virtue of recursive traversal, some crawlers perform other functions besides creating indexes. These include HTML validation, statistics gathering, dead link detection, and file mirroring. We limit discussion of crawlers to indexing.) There are dozens of technical issues associated with this process, for example:

How much of a given document should be processed by a crawler? At the extremes, some crawlers process the title only, and others, the whole document. Between these extremes, some index an arbitrary number of lines at the beginning of a document (for example, the first 100 lines) and then, perhaps, the first line of each paragraph.
How often should a given document be indexed? More frequent accesses do not always improve the quality of an index. Some documents need be indexed only once because they never change. Some change, but infrequently. At the other end of the spectrum, some volatile documents should never be indexed because they change so rapidly (for example, a daily newspaper).
How deep a traversal algorithm should be employed? Breadth-oriented traversal is more likely to locate a broader set of documents, though more superficially than depth-oriented traversal.

Along with technical complexities, there are clear deficiencies with the model.

Flat indexes.: Many experts feel that searching for raw text, as opposed to fielded searching, is one of the most crucial impediments both to semantic interoperability and relevance of search results. In the global index model, all of the words in a document (or all words in an arbitrary subset of the document) are indexed, without consideration to the context of the terms. In contrast, a sophisticated search engine may create many indexes, for example, individual title, author, and subject indexes, as well as an index for words in the "body of text", possibly an "abstract" index for words appearing in the abstract, and perhaps many others. The complexity is more than a matter of scale. Different search engines index differently and consequently may have completely different sets of indexes. Trying to create a global index supporting fielded searches becomes a near-futile task.
Distribution.: The problem of physical distribution of the single, logical index, may never be solved.
Web pages.: Mainly web pages get indexed, and most real documents and objects do not, because of the way the index is created (by traversing web pages).
Crawlers.: There are serious problems with the crawlers that create the indexes.

Problems with Crawlers

Crawlers use vast amounts of bandwidth. They grab entire documents, many of which should not be indexed, for reasons of efficiency, suitability, or propriety. They index identical versions of the same document (for example by visiting mirror sites); they index the same documents repeatedly, even when the documents have not changed. And worst of all, there are many of these crawlers, all trying to index the Internet, redundantly. They create independent indexes, each duplicating the work of the others.
Crawlers can be very annoying to a server. They may create multiple concurrent tasks, consuming a large portion of available processing power. They may access a server repeatedly. They skew statistics, frustrating a server's attempts to obtain accurate information about user accesses.
Crawlers, essentially, are dumb: they do not know where to look, so they look everywhere; they download inappropriate data (a crawler might download an image and try to index it).
Crawlers cannot, in general, discern the relative importance of documents. A web server might have a home pages, pages with administrative information, and then pages with dynamic and useful information. A crawler does not know how to distinguish these pages.
Crawlers have problems with context. A web page describing the "Subway System of Washington D.C." points to a page titled "Departure Stations". "Departure stations", then, is indexed with no context.
Crawlers access mirror sites, and as I noted above this wastes bandwidth, but another problem with mirror sites is that they may be very much out-of-date.
Some servers are un-accessible when the crawler is run, so important information is simply missed. And closely related to the missing information problem is the problem of latency: the most current information usually is never indexed, because of the period between the time it is put up on the web and when it is first indexed.
And finally, there are bad crawlers. One of the ethical issues is balancing the high cost of indexing against the greater good. Perhaps a global index is created at exorbitant cost to the community at large, but maybe it is a "good" index and serves the community at large. On the other hand is the index created at high cost to the community, and it is a "bad" index, or its created for private use.

Proposed Solutions

There are various proposals and suggested architectures for solving these problems, based on cooperation: among the index providers, information resources, and information creators (authors and publishers). One suggestion is that these various web crawlers share, or partition, the work of indexing the web. While this would provide great efficiency improvement, it is not a practical suggestion, for reasons of complexity and business.
According to another proposed architecture, the information servers would provide indexing information, either for indexers to capture, or alternatively, to send to the indexers (the pull and push models, respectively) based on the theory that the server is in the best position to decide what should and should not be indexed. This model has appeal because not only would it improve the quality of the indexing information (as the theory goes), but would also dramatically reduce the amount of information transferred and thus save considerable bandwidth.
The problems with this approach is that it requires altogether too much cooperation (in contrast to a model where the server plays a passive role) and it defeats the proprietary index schemes employed by the indexing companies; they derive significant revenue from their proprietary algorithms.
Another suggestion is that the information creators themselves, the authors, provide metadata for their documents. That approach is not practical either, for a number of reasons: it, too, defeats the proprietary schemes of the indexers; it requires too much work of the authors and they are not willing to do it; and even if they were willing, they probably would do it wrong, and bad metadata is worse than no metadata at all. Another problem with author-created metadata is the temptation for the author to attach terms simply for the purpose of increasing the potential ranking of the document.
There are primitive mechanisms employed on the web for a server to designate files that should not be indexed, if and how often a document should be indexed, date of creation, last update, last verification, and (expected) next update. These are passive methods, and there is research towards development of more active mechanisms, for a server to notify indexers, when, for example, the content of a document has changed and it should be (re-)indexed.
Finally, perhaps the most radical of proposals is that search engines adopt a common standard for indexing. This suggestion is always rejected, because the proprietary indexing scheme is often the essence of a search engine.
Despite the problems cited above, "Indexing the web", or at least giving the impression of doing so, remains a lucrative enterprise, since there still are a number of companies doing it. But as the number of web documents continues to grow, almost exponentially, the percentage of the documents indexed will continue to drop.

Distributed Searching

Distributed searching is often advanced as an alternative to the global index model. In the distributed search model a client sends a query to a meta- search engine which relays the query to several real information sources, integrates the results (see note), and presents a single, logical result set to the client.
Note: for simplicity, assume that integration is accomplished by retrieving individual result sets and merging them, though there are more efficient mechanisms.
Distributed searching overcomes many of the problems cited in the global index model, for example, fielded searching may be supported and searches may located real objects (not just web documents). On the other hand, distributed searching does have limitations.

Limitations of Distributed Searching

There are two major limitations to the distributed search model, one economic and the other technical.
The economic problem is the advertisement model. Often, one or more of the individual end-sources is a commercial service that searches free- of-charge but derives revenue from placing advertisements in the output results. This economic model based on advertising, popular in the global index model, can be defeated with distributed searching; the meta-searcher can strip out the advertisement.
A proposed solution to this problem is to develop a model by which advertisements are carried along with content. A more radical proposal is to abolish the advertisement model altogether (based on the sentiment that no economic force more strongly skews the free-enterprise model than advertisement). This is highly unlikely to happen because of the opportunities available for companies to reach customers at very low cost, and also because many users actually have come to expect advertising, and some feel cheated if advertisements are not present.
The technical limitation to distributed searching is the merging of results, and more importantly, ranking the merged results. Different search engines employ different ranking algorithms. When a meta-searcher merges raw, ranked results, the resultant merged rankings are unlikely to be very meaningful.

Normalizing Rankings

There are several techniques proposed or under study to allow a meta- searcher to produce normalize ranked results. Among these are the following.

The simplest technique is for the metasearcher to request that each server employ a specific, public ranking algorithm, in lieu of its own native and proprietary algorithm. (In this scenario, each search engine informs the meta-searcher what algorithms it supports, by supplying ranking-algorithm-ids. The meta-searcher selects a common id, without necessarily knowing anything about the identified algorithm.)
A major (non-technical) problem with this approach is that the proprietary algorithm employed is often much more suited to the particular search engine and its indexing structure, so by asking the search engine to use a different algorithm, much of the potential native ranking power is lost.
A more complex proposal is for the meta-engine to retrieve normalization information from each search engine, calibrate each set of ranked results, and produce a unified, ranked result set.
Techniques have been developed to allow a client to specify ranking criteria along with a query, for example, what weight to associate with a given term.
Another interesting though experimental approach is for the metasearcher to actually retrieve all of the documents located by the query, from the various servers, into a temporary (pseudo) database, and execute the original query against that database. This technique, though innovative, likely would be effective only on a small-scale.

Mechanisms (1), (2), and (3) are supported by the Z39.50 ZDSR profile described below.

Z39.50 Profile for Simple Distributed Search and Ranked Retrieval

The Stanford Protocol for Internet Search and Retrieval (STARTS), an initiative of the Stanford Digital Library Project, developed requirements for distributed searching and ranked retrieval during the Spring and Summer of 1996.
In July 1996, several Z39.50 experts collaborated with participants in the STARTS project to develop what was originally called the ZSTARTS profile (Z39.50 Profile for STARTS) renamed the Z39.50 Profile for Simple Distributed Search and Ranked Retrieval, ZDSR.
The profile assumes that queries pertain to text documents -- not just web pages but real documents (just documents though; not arbitrary objects).
It support searching by title, date-last-modified, author, language, url, and within the body of the text. It also supports relevant feedback searches, and stem and phonetic searching. Search results may be restricted by threshold score, or maximum number of documents.
The query includes a restriction component and a ranking component. The restriction component is the normal Z39.50 boolean query, used (in this profile) to specify the documents that qualify, for subsequent ranking. The ranking component is a list of terms each assigned a relative weight by the client.
Using this Z39.50 profile a client searches for documents, and retrieve document descriptors containing metadata about the documents. For a given document, the document descriptor includes title, abstract, publication and creation date and date last modified, size of the document, the score that the server assigned to the document for the query, and per-term meta data: for each term that was in the query, its frequency and weight; and finally, a pointer (URL) which may be used to retrieve the document (document retrieval is otherwise out-of-scope for the ZDSR profile).
[Note: as of December 1996, the ZDSR search access points and the elements comprising the document descriptors are not completely determined.]

Z39.50 and Metadata

This leads to a brief digression on the general topic of metadata, in particular, Z39.50 and metadata.
Z39.50 deals intimately with metadata, though subtly so -- subtle for two reasons: first, the term "metadata" itself came into vogue recently (relative to Z39.50) so Z39.50 has addressed metadata without referring to it as such.
Secondly, elements that are metadata from one point-of-view might not be metadata from another. For example, bibliographic elements -- author, title, publisher -- are integral elements of a bibliographic record, so although they may be metadata for the object described by a bibliographic record, they are not metadata for the bibliographic record.
In other words, since metadata is data about the subject data, if bibliographic elements are the subject data, by definition they are not metadata. Z39.50 was originally used primarily in a bibliographic context. These ZDSR document descriptors, which are fashionably described as document metadata, are, from a Z39.50 perspective, really bibliographic records.
Z39.50 implicitly recognizes a number of different categories and levels of metadata. Most fundamentally, Z39.50 distinguishes search elements from retrieval elements, that is, between the search access points and the elements in a retrieved Z39.50 record; these may be, but are not necessarily, the same. For example, a bibliographic record for a book might include a spine title as a retrievable element of the bibliographic record; the record might be searchable by title but not by spine title. Another example: an object (say, an image) may be searchable by a unique identifier, but that identifier is not an integral part of the object.
Other categories of Z39.50 metadata are:

Transient metadata associated with a query, for example the score of a document.
Metadata associated with the specific form of an object available in multiple forms, for example, its size, format, and possibly even the cost to retrieve it (which may vary by format).
Metadata that applies differently to different users: for example, terms and conditions (or rights and restrictions, etc.).
Server-level metadata, database-level metadata, and various other classes of metadata available via the Z39.50 Explain facility.
Collection level metadata, which Z39.50 has introduced for the navigation model.

Navigation

In contrast to the prevalent model of haphazard distribution of documents over the Internet, institutions are beginning to assemble compilations of documents, or more generally, objects into (relatively) coherently organized collections, which users may navigate to locate objects of interest.
Informally, a collection is an aggregation of related objects and subcollection. Collections are organized thematically, for example by subject, creator, or historical period; they may have diverse objects: text documents, images, audio, video, or arbitrary binary objects. These objects often have Associated Descriptions (described below). Collections are often hierarchical and may be distributed.
Support for effective navigation has two broad requirements: tools for semantic interoperability, and expenditure of resources, both intellectual and cataloging resources. Semantic interoperability in this context means standard navigational search semantics as well as meta-data structures, and clients designed to navigate, based on these semantics and structures.
Semantic interoperability is the easy part. Effective navigation requires coherently organized collections, which may require significant intellectual resources for aggregation, organization, and description. And in addition to the intellectual efforts, resources (either human or automated) are required to create records based on these meta-structures.

Collection Descriptive Record

An example of a navigational meta-structure is the Collection Descriptive Record, defined by the Z39.50 Collections profile, described below. The Collection Descriptive Record includes summary descriptive and navigational information for a collection. It includes a brief description, to help a user decide if a particular collection is of interest. Besides the brief description, there may be other descriptive items pertaining to the collection, that may be more comprehensive, or perhaps machine processible. These are termed Associated Descriptions. A user might view the brief description to decide whether to retrieve the Associated Description.

Associated Description

An Associated Description describes a collection or an object, and may take different forms for different categories of collections and objects. For example, it might be a finding aid, an SGML Encoded Archival Description, a cataloging record, an exhibition catalog (for museum collections); a GILS record, or even a web page. The Z39.50 Catalog Interoperability Protocol (CIP) profile defines structures including the CIP Item Descriptor, and CIP Browse data. These could be considered Associated Descriptions.
A collection may have several Associated Descriptions, and a client can retrieve descriptive information about them (essentially, description about description -- meta-meta data). This includes a brief description of the Associated Description itself, its type, size and format, and a pointer to the Associated Description, in case the user or client decides to retrieve it. This meta-meta information serves two purposes: it might help the end-user decide whether to retrieve the more comprehensive description, or the machine- processible type may allow the client to determine whether it is capable of processing it.

Related Collections

The descriptive record also includes a list of related collections, for example a parent (or otherwise superior collection) or a child (or otherwise subordinate collection) to which the user might wish to navigate (i.e. retrieve its descriptive record) if the collection that the record describes is too narrow or too broad.
A related collection may be a context collection, meaning it is the highest level superior collection likely to be of interest. It might be a sibling. Or it might not have any familial relationship: the descriptive re- cord may point to a collection that simply "might be of interest" if the user is interested in the given collection.
For a given collection, the descriptive record lists a set of related collections, and for each, describes the relationship and provides a pointer to enable the client to retrieve the collection level descriptive record for that collection.

Enumerated Objects

The collection descriptive record also enumerates the members of the collection. For each object, there is a brief description, to help the user decide whether to retrieve the object, and a pointer, to retrieve the object.

Collections Profile

In 1995 the Library of Congress convened a team from several institutions to develop a Z39.50 profile for access to digital libraries. The scope was narrowed to apply to navigation of digital collections, and was named the Z39.50 Profile for Access to Digital Collections, informally, the Collections Profile. The larger problem of access to digital libraries was left to the province of other profiling efforts, one of which, led by the Library of Congress, developed the Z39.50 Profile for Access to Digital Library Objects, informally, the Digital Library Objects (DL) Profile. Other groups were initiating independent efforts to develop profiles aimed at specific types of objects and collections. The intention was to coordinate these efforts and that these latter profiles would be developed as compatible extensions or subsets of the Collections profile.

The Collections Suite of Profiles

The Collections profile is an umbrella profile for navigating collections. It defines the Collection Descriptive Record (described above) and provides the framework for the development of extensions for specific domains. These extensions are called companion profiles to the Collections profile: the CIMI profile for access to museum objects (developed by the Consortium for the Computer Interchange of Museum Information as part of its Project CHIO: Cultural Heritage Information Online), a profile for access to digital library objects, initiated by the Library of Congress, for access to the LC digital library and similar collections, and the Catalog Interoperability Protocol (CIP) profile, for access to Earth Observation Data and associated data resources being developed by the Protocol Task Team within the Committee on Earth Observation Satellites (CEOS).
Work began on the CIP profile independent of the Collections profile development. Technically, CIP is not yet a companion profile to the collections profile but there are efforts underway to align the Collections and CIP profile. The CIP profile is one of the more advanced and well- developed of the Z39.50 profiles and one that exploits the power and capability of Z39.50.
Both the Z39.50 community at large and the CIP community believe there is mutual benefit in harmonizing the Collections and CIP profiles, so that CIP would be a conformant companion profile. At present, work continues to achieve this harmonization.
There is currently investigation into Collection companion profiles for access to multimedia objects, and musical objects.

Library of Congress

Comments: [email protected] (12/30/96)