Biblioteca Digital del Portal

<<Biblioteca Digital del Portal<<INTERAMER<<Serie Educativa<<Digital Libraries and Virtual Workplaces Important Initiatives for Latin America in the Information Age<<Chapter 5

Colección: INTERAMER
Número: 71
Año: 2002
Autor: Johann Van Reenen, Editor
Título: Digital Libraries and Virtual Workplaces. Important Initiatives for Latin America in the Information Age

4. Technical issues

Fully implementing an ETD initiative on a campus requires application of the latest technology, since the overall aim is to prepare students and universities to function effectively in the Information Age. In the following subsections a high level portrait is painted of some of the key technical issues.

Infrastructure requirements

Digital libraries focus on the content dimension of modern information technology that also depends on two other key dimensions: computing and communication. They are made possible, and can operate on the global scale needed for NDLTD, in large part because other forces, such as the growth of the Internet, and the requirements of research and education, lead to sufficient processing and bandwidth.

On many campuses, graduate students have their own computers, or gain access to computers in their research groups, in their departments, in college or campus computing laboratories, in media centers, or in library resource rooms. Most campuses have wireless networks for laptops, or wired local area networks. Student residences may have network connections served by the campus, an ISP, or modems allowing access to a wide variety of local or commercial services. Local networks are connected to regional or national networks or high-speed backbones. Countries continuously increase the bandwidth of their connections to the rest of the global information infrastructure, leading to further improvements in services for students.

Since a typical ETD only requires about a megabyte of storage, it can be managed with inexpensive systems and networks. Only if large multimedia works, such as videos, are included, is it necessary to utilize more significant amounts of storage or bandwidth. Even a large video (e.g., the several gigabytes required for a full movie compressed according to the MPEG-2 standard), though, is not expensive to store. Storage costs now are under US $5 per gigabyte, and will continue to shrink by roughly half each year into the foreseeable future. Thus, if as at Virginia Tech students pay roughly US $20 archiving fee when submitting an ETD, they will more than cover the storage expense even when submitting extensive multimedia materials. With most campuses collecting less than a thousand ETDs per year, even if the average ETD size increases from 1 to 100 megabytes, the total yearly storage requirement can be managed easily on a PC or small workstation. Similarly, transmitting ETDs over networks only requires comparable resources to downloading a software package over the Web.

More demanding than hardware or software, however, is providing services to the local campus and to other groups involved in NDLTD. Today, a federated search service is available at www.theses.org (Powell & Fox 1998), which provides a moderate level of support by routing queries to the currently small number of sites that allow searching of local collections. Fortunately, it is relatively easy through the Open Archives Initiative (Van de Sompel 2000) for a local campus to make works for which public access is allowed easily accessible through a harvesting protocol. A tiny amount of additional software suffices for a Web server to support harvesting from those locations, so that www.theses.org or other sites can collect all available metadata.

Even at the global scale, if say metadata for ten million ETDs (each probably requiring less than 1 kilobyte of storage) were aggregated, the total storage involved would only be on the order of 10 gigabytes. Thus, providing a centralized search service building upon harvesting from eventually thousands of universities is not infeasible. Further, such size would allow replication at a number of regional sites, increasing reliability and improving performance.

Production of ETDs

Production of ETDs in NDLTD should be the job of students, supported by university infrastructure. Here we consider some further details that extend the discussions in the section on students. Preparing an ETD typically requires common hardware and software readily available to graduate students. Only if multimedia content is included is it necessary to use scanners, audio or video capture devices, or other special input units when converting from analog to digital data. For such content, it also may be necessary to employ special software packages, as might be available and supported in a media center. Further, after producing a desired rendering of key research concepts, it may be necessary to convert to archival standards (e.g., JPEG, MPEG) in order to ensure future use.

To be usable with computers, content must be encoded using some type of representation scheme. Fundamentally that is what happens using any software system that allows manipulation of digital content. To shift from one representation to another it is necessary to import into one form and export into another, or to employ a conversion or translation tool. If large numbers of conversions are involved, or if the translation process is complex, scripts may be used to help automate the process. If space is an issue, conversion may involve compression, to reduce storage or networking transfer costs, followed by eventual decompression, such as when rendering occurs to final display, sound, or print forms. In any case, standards should be followed as much as possible, to facilitate interchange and preservation.

Generally, standards exist for common types of content. Only in the case of unusual, or highly interactive multimedia content, is it likely to be the case that no suitable standards have yet been developed. For example, with packages like HyperCard™, AuthorWare™, or Director™, when special programs or scripts are involved, the only recourse may be to provide a vendor-specific, secret, proprietary file. In such cases it is recommended that to help allow partial preservation into the future, a sequence of screen dumps, exports of the text of scripts or routines, and other partial views or extracts should also be produced and retained. The bottom line in all this is for students to understand key concepts of content, storage, manipulation, interchange, and reuse so as to be prepared for future work with digital information.

Page description languages

The most popular representation of ETDs is inside word processing systems. However, these forms typically involve vendor-specific, secret, proprietary schemes. Accordingly for interchange and preservation conversion is needed to a more standard form. In this subsection we explore further the use of PDF, while in the next subsection XML is considered.

Many modern printers receive data ready to be printed in the PostScript language, developed in the 1980s by Adobe. To increase portability and functionality, Adobe developed PDF in the 1990s. Their Distiller will convert from PostScript to PDF, which is a file format that includes a section containing page image descriptions. Other parts of a PDF file may include hypertext links, images, thumbnail versions of pages, digital signatures, a table of contents or bookmark structure, and other information. PDF, a published standard that has been used by other software companies as well, should become an international standard too. One noteworthy feature of PDF is that it is scalable, so that those with limited visual abilities may enlarge parts of a document as needed to enhance perception. Further, it supports annotation, so that draft ETDs can have notes added by reviewers to pass on corrections and suggestions. A digital signature feature allows the work to be secured so as to ensure authenticity. Watermarking allows ownership to be asserted so subsequent unauthorized use can be detected. Other tools may allow searching inside a PDF file for particular words or phrases. Doubtless additional capabilities and enhancements will extend its utility, probably helping position it to facilitate some of the operations now feasible with XML.

Markup languages

ETDs can be interchanged and preserved using SGML or XML. Given current trends, it is most likely that XML will be used, so the following discussion focuses on that scheme; working with SGML would be similar except in some details.

One use of XML is to encode metadata about ETDs. That concept was explored in connection with applying Dublin Core to ETDs at the fall 1999 DC-7 Conference in Frankfurt. Further discussion proceeded at a May 2000 Berlin meeting and at a short meeting at ECDL’2000 in Lisbon in September 2000. Further consensus was reached on this matter at a January 2001 meeting hosted by OCLC in Dublin, Ohio. XML also can encode entire ETDs, typically according to a structuring standard or DTD as described in the section for students.

In 1988, the first SGML DTD for ETDs was developed for Virginia Tech, by SoftQuad. Neill Kipp developed a newer version in 1997. XML versions were later developed at Virginia Tech, University of Iowa, University of Michigan, University of Montreal, and other locations. Any of these structures allows an ETD to be prepared and later searched, displayed, printed, or reused in part. Further, it may be possible to convert most if not all of a work between the structuring described by one DTD and that of another DTD, so at least some portability is ensured. It is hoped that this matter will be explored further by the NDLTD standards committee, which aims to support as much standardization as is feasible given the many requirements involved in allowing graduate students to participate in all disciplines, countries, language groups, and educational settings.

Preparing XML can be done through conversion from word processing systems (e.g., Word or WordPerfect) or formatting schemes (e.g., LaTeX). From a word processor, some well-known interchange form, such as RTF, that can carry style and other information as well as textual content, is usually the export target. Translators that have been trained to convert from particular RTF sequences to XML constructs then prepare an XML document that can be checked with an XML parser and then refined with an XML editor.

XML editors also can be used directly by authors to prepare the entire ETD. This style of authoring may be particularly appropriate for some types of research where many media objects carry the content. For example, this was done with a chemistry ETD prepared at Virginia Tech that used SGML tools to prepare the document skeleton, which referred to scores of VRML and other special files that used virtual reality and other representations to carry the bulk of the message. However, until training about XML and support for it with powerful tools expands, such an approach is likely to require either extensive knowledge or a good deal of assistance by campus personnel.

The final stage of working with XML involves rendering or presenting of research results. Standards like XSL and corresponding tools, along with definitions of how to present each XML construct, allow content to be portrayed in human-readable forms.

Metadata, cross walks, packaging, naming standards

Today, most ETDs are catalogued in a local library. Typically, the data is represented using a MARC (Machine Readable Catalogue) scheme, such as USMARC, UKMARC, or UNIMARC. “Crosswalks” or conversion routines exist to convert from one such form to another, or from MARC to XML, or vice versa. For example, Robert France at Virginia Tech developed a MARC to XML converter so that Open Archives sites can export MARC-encoded metadata through XML.

ETDs sometimes are more than a single document supplemented with metadata. When there are multiple parts it is common to store them as separate files that are in a single directory. It is simple to upload each of these files, and for readers to download some or all as desired. Packaging with schemes like tar or zip are a bit risky to employ since they are not highly standard or portable. In the future, digital library packaging schemes may emerge, however, that are more suitable.

Naming of ETDs is another realm for standardization. OCLC’s PURL and CNRI’s handle schemes allow URN (uniform resource name) methods to attach persistent names to ETDs so that they can be located using them, now and into the foreseeable future.

Post processing

The final stages of production of digital content are usually referred to as “post processing.” On occasion, university staff may undertake some conversion to standard forms (usually then saving both the “raw” and converted forms). Typically, though, these stages proceed after all checking and correction is completed, and a final version is received. In the case of ETDs, this involves student submission of the approved version. Only in rare cases will some important correction or addendum be allowed thereafter, which can be handled through typical version control schemes, with suitable approvals recorded.

Protecting ETDs involves several types of special processing. Authenticating an ETD calls for ensuring that it remains unchanged relative to the original submission. By computing a number of mathematical functions over an ETD file, such as parity, checksum, or hash codes, a record can be produced that can be compared with the results of the same computations made over what is assumed to be a proper copy. This type of process is used with digital signatures, which also include certification that a trusted party vouches for the signatures. In the case of watermarks, some image can be overlaid with an image chosen by the property right owner, so that the source and customer of the distribution of a digital object can be proven. In steganography, where data is hidden inside a digital object, arbitrary information may be recorded for later use in prosecuting thieves, in ways that are hard to remove in spite of subsequent analysis or compression. All of these schemes may be deployed when desired by authors, or as standard practice at individual institutions, as needed to ensure the integrity of policies regarding preservation, protection, and rights management.

Further protection is required to account for physical damage, disasters, or other attacks on ETD archives. Copies should be made using various media forms, such as CD-ROM that may have long shelf life and may be immune to electromagnetic forces. Stronger security results from having copies at multiple locations, preferably distant from the master copy. Backups, off-site storage, and mirroring methods provide safety and in the latter case also may help improve access from remote users.

Dissemination of ETDs

Though providing access to ETDs is not the most crucial part of NDLTD activities, supporting dissemination is an important responsibility. The first aspect of this involves identifying ETDs. As mentioned, some URN scheme is needed so that a permanent identifier can be given to graduating students that will ensure persistent access thereafter. An ETD can be assigned an ISBN (e.g., as is done by UMI, which considers that it is thus publishing a book) or a DOI (i.e., digital object identifier, often given by publishers). URNs like PURLs or handles can be used (and, indeed, DOIs build upon handle technology). If a university participates in the Open Archives Initiative, then each work is assigned a unique identifier in that archive, and the archive in turn has a unique identifier in the OAI registry.

A second support for dissemination is to have a metadata record for each ETD, which carries any classification and cataloging data available. Whether some type of MARC-based scheme or Dublin Core form is used, some standard interchange mechanism, like MARC transport format or XML, also is required. When possible, the metadata should follow standards developed by NDLTD to support global resource discovery. Typically, in addition to title and abstract, there should be author-assigned keywords, entries according to a discipline-specific classification system, and entries made following more general schemes such as: Library of Congress Subject Headings, Dewey Decimal Classification, UNESCO or UMI categories, etc.

Finally, the metadata records about ETDs, possibly supplemented with the actual ETD content itself, should be used to support resource discovery and access. Typical approaches are explained in the following two subsections.

Databases and information retrieval systems

Managing submission of ETDs and supporting subsequent access can be aided by database management technology. Anthony Atkins at Virginia Tech has developed a number of versions of such software, has refined that and made it portable, and supports its use by many NDLTD members. This has in turn been adapted to multilingual use in Spain and other countries, and to large projects as in the Australian initiative. At MIT, the Dienst software (Lagoze & Davis 1995) has been adapted instead, while at sites like University of Montreal, Canada and Humboldt University, Germany, other software has been developed.

In addition to managing submission, workflow, and metadata fields with database tools, information retrieval systems are often used to support searching and browsing in ETD collections. In most cases this is done with software used on a campus in connection with other types of searching efforts or in connection with library automation services.

One generous offering by an NDLTD member, VTLS Inc., is to use its powerful library automation software system, Virtua, to support the worldwide initiative, free of charge. VTLS is happy to receive either MARC-type or XML formatted metadata for all ETDs created worldwide, in any language, and to provide a centralized union catalog search service through Virtua. Since the metadata provided should have a unique identifier for each ETD described, this mechanism should provide valuable support for discovering and accessing ETDs.

Searching

Since students learning about ETDs should gain proficiency in searching through digital libraries, it is important that they develop suitable skills. They should understand about data and metadata, and be able to work with metadata records that eventually lead to ETDs. In particular, they should understand the 15 elements in the Dublin Core (Dublin Core Community 1999) and how searches can be built using one or more of those. They should understand about classification and categorization schemes, how to browse through thesauri, how to narrow or broaden, how to navigate to related concepts, how to combine elements of a description, and the principles behind set-based or ranking-based retrieval systems. They should understand about full-text searching as well as content-based multimedia retrieval (e.g., of images, sounds, or videos). Further, they should feel comfortable with varying styles of interfaces, searching using queries or examples, schemes involving relevance feedback, and information summarization and visualization mechanisms aimed to enhance their capabilities for finding relevant information. For all this to be possible, universities and others supporting NDLTD should provide powerful services, and ensure that students gain requisite skills with them suitable for effective functioning in the Information Age.