The URLs (Uniform Resource Locators) used for the World Wide Web (WWW) have been, and will continue to be, very useful, but URLs have the major shortcoming that they are tied to a particular file in a particular computer. A URL will no longer find a document or image if the file has to be moved to a different computer (for example, when hardware is replaced) or if the computer has to be given a different Internet address (for example, after a reorganization or a merger of companies). Any links to that resource, whether from WWW pages, catalog records, or finding aids would have to be changed. The URL form is not the answer for naming objects in a digital archive being built for the long term.
The Library of Congress does not have to develop a system for global lookup of names. The Internet community has been exploring ways of implementing a scheme for names that will be globally unique and persistent (i.e. will not need to be changed if the relevant files must be moved). The term Uniform Resource Name (URN) is used to describe the type of identifier that is needed. At the December meeting of the Internet Engineering Task Force, there was consensus among the active participants in this effort to establish a single technical framework in which different naming schemes that have been proposed could be implemented and deployed. At some time in the future, one URN scheme may be chosen as a standard, but the Library of Congress does not have to be too concerned about picking the right "horse" now. The different candidate schemes all allow owners of resources considerable flexibility in forming names.
LC does have to develop conventions that will be used to name LC resources within such a global framework. The NDLP Digital Naming Committee has developed a draft document suggesting naming conventions that are appropriate for the NDLP collections, bearing in mind the need for a structure that supports project control and quality assurance during production, retrieval using the existing archive implementation, and compatibility with MARC bibliographic records. The naming structure suggested by the committee incorporates components representing an aggregate (collection), an item within that aggregate, and, for some types of item, files that must be integrated to form the item. These components can be incorporated into a URN format when one is available for use.
This part of the digital archive will require research and development. No-one has yet developed a system that permits storage and retrieval of objects of arbitrary type and format and controls access on the basis of copyright status (or similar provisions imposed by terms of gifts or relating to individual privacy). The first object-oriented database systems are being developed and coming to market now, but none provide the functionality today that will be required for a large long-term archive that is part of a distributed, national digital library accessible to the general public. LC will explore options through partnerships with outside organizations, while continuing to add NDLP collections using the existing implementation.
Whatever the system (or systems) selected for managing the collection of digital objects, there is a key question to resolve: what information about each object must be stored with each object in the repository? The term metadata is used (some might say over-used) to refer to information about information, as opposed to the information content itself. In the print world, the distinction was clear: a catalog card (and later its MARC equivalent) was obviously different from the book it described. One was stored in a catalog drawer and the other on the shelf. In the digital world, both items are sequences of bits stored in a computer's file-system. However, the logical distinction remains important, and the future tools that can be built for a user to discover and retrieve a relevant item from a digital library will depend on decisions made about which items of metadata to store with the item in the repository.
It is clear that, to perform the functions required of it, the repository must hold basic technical and administrative information about each item, such as digital format, copyright status, and the information encoded in the naming scheme in current use, which includes identification of the aggregate (collection) to which the item belongs. It is also clear, at the other extreme, that the resources do not exist to prepare full descriptive MARC records for each item in the archive; many will be accessed primarily through an online finding aid or register. An important challenge for the NDLP is to determine which items of metadata should be in the repository, and which will be held only in related indexes or catalogs (such as records in MUMS or the brief item-level MARC records available for a few of the NDLP collections) or in the items themselves (for example, in the headers of TIFF image records or SGML-encoded documents). This challenge will be addressed as part of the prototype development under way.
The hardware and software components for LC's Hierarchical Storage Management System will be acquired from commercial vendors. Making cost-effective use of storage is a common problem in the data processing industry. Systems that automatically relegate lesser-used files to cheaper storage are being developed for the commercial data processing market. The system acquired by LC will provide storage management support to files associated with other computing applications using LC's UNIX servers, such as Thomas.
Digital Archive Structure:4 --
(12/27/95)