<<Biblioteca Digital del Portal<<INTERAMER<<Serie Educativa<<Digital Libraries and Virtual Workplaces Important Initiatives for Latin America in the Information Age<<Chapter 5
Colección: INTERAMER
Número: 71
Año: 2002
Autor: Johann Van Reenen, Editor
Título: Digital Libraries and Virtual Workplaces. Important Initiatives for Latin America in the Information Age
4. Technical issues
Fully implementing an ETD
initiative on a campus requires application of the latest technology, since
the overall aim is to prepare students and universities to function effectively
in the Information Age. In the following subsections a high level portrait
is painted of some of the key technical issues.
Infrastructure requirements
Digital libraries focus
on the content dimension of modern information technology that also depends
on two other key dimensions: computing and communication. They are made possible,
and can operate on the global scale needed for NDLTD, in large part because
other forces, such as the growth of the Internet, and the requirements of
research and education, lead to sufficient processing and bandwidth.
On many campuses, graduate
students have their own computers, or gain access to computers in their research
groups, in their departments, in college or campus computing laboratories,
in media centers, or in library resource rooms. Most campuses have wireless
networks for laptops, or wired local area networks. Student residences may
have network connections served by the campus, an ISP, or modems allowing
access to a wide variety of local or commercial services. Local networks are
connected to regional or national networks or high-speed backbones. Countries
continuously increase the bandwidth of their connections to the rest of the
global information infrastructure, leading to further improvements in services
for students.
Since a typical ETD only
requires about a megabyte of storage, it can be managed with inexpensive systems
and networks. Only if large multimedia works, such as videos, are included,
is it necessary to utilize more significant amounts of storage or bandwidth.
Even a large video (e.g., the several gigabytes required for a full movie
compressed according to the MPEG-2 standard), though, is not expensive to
store. Storage costs now are under US $5 per gigabyte, and will continue to
shrink by roughly half each year into the foreseeable future. Thus, if as
at Virginia Tech students pay roughly US $20 archiving fee when submitting
an ETD, they will more than cover the storage expense even when submitting
extensive multimedia materials. With most campuses collecting less than a
thousand ETDs per year, even if the average ETD size increases from 1 to 100
megabytes, the total yearly storage requirement can be managed easily on a
PC or small workstation. Similarly, transmitting ETDs over networks only requires
comparable resources to downloading a software package over the Web.
More demanding than hardware
or software, however, is providing services to the local campus and to other
groups involved in NDLTD. Today, a federated search service is available at
www.theses.org (Powell & Fox 1998), which provides a moderate level of
support by routing queries to the currently small number of sites that allow
searching of local collections. Fortunately, it is relatively easy through
the Open Archives Initiative (Van de Sompel 2000) for a local campus to make
works for which public access is allowed easily accessible through a harvesting
protocol. A tiny amount of additional software suffices for a Web server to
support harvesting from those locations, so that www.theses.org or other sites
can collect all available metadata.
Even at the global scale,
if say metadata for ten million ETDs (each probably requiring less than 1
kilobyte of storage) were aggregated, the total storage involved would only
be on the order of 10 gigabytes. Thus, providing a centralized search service
building upon harvesting from eventually thousands of universities is not
infeasible. Further, such size would allow replication at a number of regional
sites, increasing reliability and improving performance.
Production of ETDs
Production of ETDs in NDLTD
should be the job of students, supported by university infrastructure. Here
we consider some further details that extend the discussions in the section
on students. Preparing an ETD typically requires common hardware and software
readily available to graduate students. Only if multimedia content is included
is it necessary to use scanners, audio or video capture devices, or other
special input units when converting from analog to digital data. For such
content, it also may be necessary to employ special software packages, as
might be available and supported in a media center. Further, after producing
a desired rendering of key research concepts, it may be necessary to convert
to archival standards (e.g., JPEG, MPEG) in order to ensure future use.
To be usable with computers,
content must be encoded using some type of representation scheme. Fundamentally
that is what happens using any software system that allows manipulation of
digital content. To shift from one representation to another it is necessary
to import into one form and export into another, or to employ a conversion
or translation tool. If large numbers of conversions are involved, or if the
translation process is complex, scripts may be used to help automate the process.
If space is an issue, conversion may involve compression, to reduce storage
or networking transfer costs, followed by eventual decompression, such as
when rendering occurs to final display, sound, or print forms. In any case,
standards should be followed as much as possible, to facilitate interchange
and preservation.
Generally, standards exist
for common types of content. Only in the case of unusual, or highly interactive
multimedia content, is it likely to be the case that no suitable standards
have yet been developed. For example, with packages like HyperCard™, AuthorWare™,
or Director™, when special programs or scripts are involved, the only recourse
may be to provide a vendor-specific, secret, proprietary file. In such cases
it is recommended that to help allow partial preservation into the future,
a sequence of screen dumps, exports of the text of scripts or routines, and
other partial views or extracts should also be produced and retained. The
bottom line in all this is for students to understand key concepts of content,
storage, manipulation, interchange, and reuse so as to be prepared for future
work with digital information.
Page description languages
The most popular representation
of ETDs is inside word processing systems. However, these forms typically
involve vendor-specific, secret, proprietary schemes. Accordingly for interchange
and preservation conversion is needed to a more standard form. In this subsection
we explore further the use of PDF, while in the next subsection XML is considered.
Many modern printers receive
data ready to be printed in the PostScript language, developed in the 1980s
by Adobe. To increase portability and functionality, Adobe developed PDF in
the 1990s. Their Distiller will convert from PostScript to PDF, which is a
file format that includes a section containing page image descriptions. Other
parts of a PDF file may include hypertext links, images, thumbnail versions
of pages, digital signatures, a table of contents or bookmark structure, and
other information. PDF, a published standard that has been used by other software
companies as well, should become an international standard too. One noteworthy
feature of PDF is that it is scalable, so that those with limited visual abilities
may enlarge parts of a document as needed to enhance perception. Further,
it supports annotation, so that draft ETDs can have notes added by reviewers
to pass on corrections and suggestions. A digital signature feature allows
the work to be secured so as to ensure authenticity. Watermarking allows ownership
to be asserted so subsequent unauthorized use can be detected. Other tools
may allow searching inside a PDF file for particular words or phrases. Doubtless
additional capabilities and enhancements will extend its utility, probably
helping position it to facilitate some of the operations now feasible with
XML.
Markup languages
ETDs can be interchanged
and preserved using SGML or XML. Given current trends, it is most likely that
XML will be used, so the following discussion focuses on that scheme; working
with SGML would be similar except in some details.
One use of XML is to encode
metadata about ETDs. That concept was explored in connection with applying
Dublin Core to ETDs at the fall 1999 DC-7 Conference in Frankfurt. Further
discussion proceeded at a May 2000 Berlin meeting and at a short meeting at
ECDL’2000 in Lisbon in September 2000. Further consensus was reached on this
matter at a January 2001 meeting hosted by OCLC in Dublin, Ohio. XML also
can encode entire ETDs, typically according to a structuring standard or DTD
as described in the section for students.
In 1988, the first SGML
DTD for ETDs was developed for Virginia Tech, by SoftQuad. Neill Kipp
developed a newer version in 1997. XML versions were later developed at Virginia
Tech, University of Iowa, University of Michigan, University of Montreal,
and other locations. Any of these structures allows an ETD to be prepared
and later searched, displayed, printed, or reused in part. Further, it may
be possible to convert most if not all of a work between the structuring described
by one DTD and that of another DTD, so at least some portability is ensured.
It is hoped that this matter will be explored further by the NDLTD standards
committee, which aims to support as much standardization as is feasible given
the many requirements involved in allowing graduate students to participate
in all disciplines, countries, language groups, and educational settings.
Preparing XML can be done
through conversion from word processing systems (e.g., Word or WordPerfect)
or formatting schemes (e.g., LaTeX). From a word processor, some well-known
interchange form, such as RTF, that can carry style and other information
as well as textual content, is usually the export target. Translators that
have been trained to convert from particular RTF sequences to XML constructs
then prepare an XML document that can be checked with an XML parser and then
refined with an XML editor.
XML editors also can be
used directly by authors to prepare the entire ETD. This style of authoring
may be particularly appropriate for some types of research where many media
objects carry the content. For example, this was done with a chemistry ETD
prepared at Virginia Tech that used SGML tools to prepare the document skeleton,
which referred to scores of VRML and other special files that used virtual
reality and other representations to carry the bulk of the message. However,
until training about XML and support for it with powerful tools expands, such
an approach is likely to require either extensive knowledge or a good deal
of assistance by campus personnel.
The final stage of working
with XML involves rendering or presenting of research results. Standards like
XSL and corresponding tools, along with definitions of how to present each
XML construct, allow content to be portrayed in human-readable forms.
Metadata, cross walks, packaging, naming standards
Today, most ETDs are catalogued
in a local library. Typically, the data is represented using a MARC (Machine
Readable Catalogue) scheme, such as USMARC, UKMARC, or UNIMARC. “Crosswalks”
or conversion routines exist to convert from one such form to another, or
from MARC to XML, or vice versa. For example, Robert France at Virginia Tech
developed a MARC to XML converter so that Open Archives sites can export MARC-encoded
metadata through XML.
ETDs sometimes are more
than a single document supplemented with metadata. When there are multiple
parts it is common to store them as separate files that are in a single directory.
It is simple to upload each of these files, and for readers to download some
or all as desired. Packaging with schemes like tar or zip are a bit risky
to employ since they are not highly standard or portable. In the future, digital
library packaging schemes may emerge, however, that are more suitable.
Naming of ETDs is another
realm for standardization. OCLC’s PURL and CNRI’s handle schemes allow URN
(uniform resource name) methods to attach persistent names to ETDs so that
they can be located using them, now and into the foreseeable future.
Post processing
The final stages of production
of digital content are usually referred to as “post processing.” On occasion,
university staff may undertake some conversion to standard forms (usually
then saving both the “raw” and converted forms). Typically, though, these
stages proceed after all checking and correction is completed, and a final
version is received. In the case of ETDs, this involves student submission
of the approved version. Only in rare cases will some important correction
or addendum be allowed thereafter, which can be handled through typical version
control schemes, with suitable approvals recorded.
Protecting ETDs involves
several types of special processing. Authenticating an ETD calls for ensuring
that it remains unchanged relative to the original submission. By computing
a number of mathematical functions over an ETD file, such as parity, checksum,
or hash codes, a record can be produced that can be compared with the results
of the same computations made over what is assumed to be a proper copy. This
type of process is used with digital signatures, which also include certification
that a trusted party vouches for the signatures. In the case of watermarks,
some image can be overlaid with an image chosen by the property right owner,
so that the source and customer of the distribution of a digital object can
be proven. In steganography, where data is hidden inside a digital object,
arbitrary information may be recorded for later use in prosecuting thieves,
in ways that are hard to remove in spite of subsequent analysis or compression.
All of these schemes may be deployed when desired by authors, or as standard
practice at individual institutions, as needed to ensure the integrity of
policies regarding preservation, protection, and rights management.
Further protection is required
to account for physical damage, disasters, or other attacks on ETD archives.
Copies should be made using various media forms, such as CD-ROM that may have
long shelf life and may be immune to electromagnetic forces. Stronger security
results from having copies at multiple locations, preferably distant from
the master copy. Backups, off-site storage, and mirroring methods provide
safety and in the latter case also may help improve access from remote users.
Dissemination of ETDs
Though providing access
to ETDs is not the most crucial part of NDLTD activities, supporting dissemination
is an important responsibility. The first aspect of this involves identifying
ETDs. As mentioned, some URN scheme is needed so that a permanent identifier
can be given to graduating students that will ensure persistent access thereafter.
An ETD can be assigned an ISBN (e.g., as is done by UMI, which considers that
it is thus publishing a book) or a DOI (i.e., digital object identifier, often
given by publishers). URNs like PURLs or handles can be used (and, indeed,
DOIs build upon handle technology). If a university participates in the Open
Archives Initiative, then each work is assigned a unique identifier in that
archive, and the archive in turn has a unique identifier in the OAI registry.
A second support for dissemination
is to have a metadata record for each ETD, which carries any classification
and cataloging data available. Whether some type of MARC-based scheme or Dublin
Core form is used, some standard interchange mechanism, like MARC transport
format or XML, also is required. When possible, the metadata should follow
standards developed by NDLTD to support global resource discovery. Typically,
in addition to title and abstract, there should be author-assigned keywords,
entries according to a discipline-specific classification system, and entries
made following more general schemes such as: Library of Congress Subject Headings,
Dewey Decimal Classification, UNESCO or UMI categories, etc.
Finally, the metadata records
about ETDs, possibly supplemented with the actual ETD content itself, should
be used to support resource discovery and access. Typical approaches are explained
in the following two subsections.
Databases and information retrieval systems
Managing submission of ETDs
and supporting subsequent access can be aided by database management technology.
Anthony Atkins at Virginia Tech has developed a number of versions of such
software, has refined that and made it portable, and supports its use by many
NDLTD members. This has in turn been adapted to multilingual use in Spain
and other countries, and to large projects as in the Australian initiative.
At MIT, the Dienst software (Lagoze & Davis 1995) has been adapted
instead, while at sites like University of Montreal, Canada and Humboldt University,
Germany, other software has been developed.
In addition to managing
submission, workflow, and metadata fields with database tools, information
retrieval systems are often used to support searching and browsing in ETD
collections. In most cases this is done with software used on a campus in
connection with other types of searching efforts or in connection with library
automation services.
One generous offering by
an NDLTD member, VTLS Inc., is to use its powerful library automation software
system, Virtua, to support the worldwide initiative, free of charge. VTLS
is happy to receive either MARC-type or XML formatted metadata for all ETDs
created worldwide, in any language, and to provide a centralized union catalog
search service through Virtua. Since the metadata provided should have a unique
identifier for each ETD described, this mechanism should provide valuable
support for discovering and accessing ETDs.
Searching
Since students learning
about ETDs should gain proficiency in searching through digital libraries,
it is important that they develop suitable skills. They should understand
about data and metadata, and be able to work with metadata records that eventually
lead to ETDs. In particular, they should understand the 15 elements in the
Dublin Core (Dublin Core Community 1999) and how searches can be built using
one or more of those. They should understand about classification and categorization
schemes, how to browse through thesauri, how to narrow or broaden, how to
navigate to related concepts, how to combine elements of a description, and
the principles behind set-based or ranking-based retrieval systems. They should
understand about full-text searching as well as content-based multimedia retrieval
(e.g., of images, sounds, or videos). Further, they should feel comfortable
with varying styles of interfaces, searching using queries or examples, schemes
involving relevance feedback, and information summarization and visualization
mechanisms aimed to enhance their capabilities for finding relevant information.
For all this to be possible, universities and others supporting NDLTD should
provide powerful services, and ensure that students gain requisite skills
with them suitable for effective functioning in the Information Age.