|April 15, 2003, Volume 7, Number 2||
Digitizing, Archiving, and Preserving Japanese Cultural Heritage
Activities and Initiatives at the National Diet Library
One of the primary duties of the National Diet Library (NDL) of Japan is to collect and preserve Japanese publications as the nation's cultural and intellectual assets. For this purpose NDL depends greatly on the legal deposit system for its collection of materials. Moreover, the NDL collects, through purchase and donation, books published before the legal deposit system came into existence, as well as older materials and foreign reference and academic publications. In addition to those traditional activities, the NDL needs to take care of new materials that have been increasingly created and disseminated in digital form. "The National Diet Library Electronic Library Concept," promulgated in fiscal year 1998, defines the digital library as "the provision by a library of primary information (actual materials) and secondary information (information about the materials) electronically, via communications networks, together with the infrastructure for this purpose." Since this concept was established, NDL has prepared to create its own digital library. As primary information, the library is already providing the Full-Text Database System for the Minutes of the Diet in cooperation with the House of Representatives and the House of Councillors, as well as the Rare Books Image Database and the online exhibitions called NDL Gallery created by digitizing our collections. As secondary information, bibliographic data for Japanese and Western books has been provided via the NDL-OPAC. In the autumn of 2002 we offered several new services to the public.
Digital Library from the Meiji Era
Some 140 years ago, East met West. As a result of the encounter, quite a number of cultural assets were produced that had a great impact on building modern Japanese society. NDL has approximately 102,000 titles and 169,000 volumes of books published in the Meiji era (1868-1912), the period of the westernization of Japan. Since these books are fragile, we converted them to microfiche for public use starting in 1993. Access to those materials was limited to people who were able to come to the library to use the microfiche.
In recent years NDL has been harnessing information and communication technologies to offer its digital library as a new service. One of the pillars of this service is to digitize the NDL collections and provide public access to them. As of October 2002, we have supplied digital images of our Meiji collections whose copyrights have expired under the title Digital Library from the Meiji Era.
The contents of the collection range from philosophy, history, and social sciences to art and literature. So far we have reached a greater audience than expected, people who had been very interested in seeing the materials, but who had never been able to come in person. We have also enjoyed good responses from people abroad, who say that this access will contribute to Japanese studies on a large scale. There have been around 760,000 hits in the four months since the system was implemented.
The texts and illustrations of the books are put into a digital image format, in both GIF and our own high-compression format (LINDRA), for convenient use. Using a plug-in customized for this system as an NDL viewer, users can freely navigate through the images, change the size from 25% to 300%, and print on paper at exactly the right size. In addition, the system offers efficient, detailed searches with features like searchable tables of contents and bibliographic records, as well as a function to bookmark texts.
As of now, around 20,000 titles and 30,000 volumes are available for access via the Internet. The files come to about 350 GB in size. We are planning to add another 10,000 titles and 15,000 volumes in the coming months. By the end of fiscal year 2004 most of our Meiji collections will be available to the public through the Internet.
One of the most-difficult challenges in building this database system is clearing copyright. Although the system is able to manage the copyrights page by page, we have been able to identify only about one-third of the copyright holders for 169,000 volumes. Thus we have begun to ask the public through our Web site to get in touch with copyright holders we have not yet discovered. If we cannot find, and get permission from, copyright holders, in the end we will need to apply for permission to the Director-General of the Agency for Cultural Affairs to clear the copyrights of those books. We will also need some fine-tuning based on feedback to keep the system up-to-date and easy to use.
As Japan's only depository library, NDL has been collecting publications in Japan, including maps, phonographic discs, and microfilms, with the help of the legal deposit system mandated by the National Diet Library Law. CD-ROMs and other "packaged" electronic publications became subject to the legal deposit system in the autumn of 2000. As for digital information on telecommunications networks, in March 2002 the Librarian of the National Diet Library asked the Legal Deposit System Council, an advisory panel of outside experts, to consider whether "networked digital publications" could be put into the legal deposit system, and, if not, what kind of legal framework would make it possible for the NDL to collect online information.
Until the Legal Deposit System Council comes to a conclusion, the NDL will implement experimental projects for acquiring and storing online information by contract, as well as for the navigation of databases on the Internet. These projects have been planned as a part of the NDL’s Digital Library Project.
One of the projects is WARP—(Web Archiving Project). Since much of the information on the Web is regularly updated and deleted on a daily basis worldwide, the NDL is collecting and preserving information from the Web sites of various organizations that have agreed to participate in the project. WARP will also allow us to collect and preserve digital editions of periodicals and born-digital periodicals on the Internet. The results of this project will be submitted to the Legal Deposit System Council for reference as it considers a possible legal framework that would allow the collection of domestic networked information. We have already collected over 460 titles of online periodicals and a dozen Web sites. Although we are now taking a selective approach, we are looking for ways to collect in bulk and are investigating a couple of projects overseas.
The second project is Dnavi, the NDL Database Navigation Service. Until now, NDL has offered its wealth of library resources through a number of research and reference services. It is now crucial that we make use of the digital information resources on the Internet. While we are exploring the best systems and technologies for Web archiving with WARP, the databases still cannot be archived because they are in the so-called Deep Web.
The wealth of databases on the Internet provides indispensable information resources for academic research and other forms of study and surveys. For these databases Dnavi creates such records as title, creator, category, and content. Users can access the NDL Web site and be linked to them. Dnavi, which just started in November 2002, is a portal that has recorded a large amount of information from Web sites in Japan and that helps users to navigate a variety of databases. It already contains more than 5,000 databases.
Although the importance of preserving digital information has been recognized in intellectual communities worldwide, and so many projects and studies have been aimed at preservation in recent years, we must admit that few in Japan recognize that digital preservation is crucial for future generations. Thus few projects have been implemented especially for born-digital materials.
As already mentioned, we have been focusing on digitizing the printed materials in our collections to provide access to them as one of our services to the public via the Internet, not for long-term preservation. This point of view seems to be the same for other organizations, institutions, and businesses in Japan. We know that digitizing rare books or images is an important part of the preservation of our heritage but also recognize that it is not enough for this day and age.
Given this situation, NDL has begun research and study for long-term preservation of digital information to make the public aware of its importance. We are going to establish a group to discuss issues in this field and improve our skills, technologies, and collaborations in conjunction with the communities concerned.
Fiscal year 2002 is the first year of a three-year term for research and study on the preservation of digital information in the NDL. The main purpose of this project is to set up comprehensive guidelines to fix our long-term strategy.
The guidelines should include the following policies:
By setting up our own guidelines, we will be able to handle increasing amounts of digital information both in physical media and networked information under an established policy. In addition, announcing our guidelines will help to increase awareness of the importance of preserving digital information in our society.
We plan to apply the following timeline:
Fiscal year 2002
Fiscal year 2003
Fiscal year 2004
All the projects we have mentioned have just started. As the saying goes, "This is just the prelude."
The Paradigma Project
Digital documents of all kinds are disappearing daily, and with them the opportunity for new generations of readers to study and enjoy today's documents in the future. The preservation of our digital cultural heritage is an increasingly important and challenging issue. In response to the situation, about fifteen European countries have started some type of Web archiving activity.
Different countries have chosen different collection strategies: Denmark and Australia have taken the selective approach; Sweden, Iceland, and Finland have harvested their entire national Web spaces; and the National Library of the Netherlands has made an agreement with the Dutch Publishers' Association (NUV) for the deposit of electronic publications offline and online. Only five of the countries that are involved in Web archiving can base their work on legal deposit legislation, and Norway is one of them.
Background on Legal Deposit in Norway
Legal deposit has a long tradition in Norway. The first Legal Deposit Act for Denmark/Norway was passed in 1697, and censorship undoubtedly played an important role in its establishment. The law remained in force until the Union with Denmark was dissolved in 1814. A royal decree on legal deposit was passed in 1815, followed by a new Legal Deposit Act in 1882. This again was succeeded by the Legal Deposit Act of 9 June 1939. The common denominator of all these acts was that they included printed material only. However, as new media developed, the need to pass a new and extended law became more and more evident. The present Legal Deposit Act was thus passed on 9 June 1989, and, of course, the main intent of this law was no longer censorship, but cultural preservation.
The National Library of Norway's current Web archiving work is strongly influenced by the Norwegian Legal Deposit Act. The purpose of this act is to
Considered extremely modern when it was passed in 1989, the act covers all generally available Norwegian documents stored in any medium, including paper, microforms, photographs, combined documents, sound recordings, films, video, electronic publications, and broadcast programs. It also covers documents published abroad for Norwegian publishers and those specially adapted for a Norwegian public.
The act does not cover documents found in closed networks, computer software, documents accessible only through a company or organization's intranet, net communications (i.e., e-mail or closed discussion and chat groups of a private nature), archival material covered by other legislation, or official governmental publications.
Chapter 9 of the act's regulations (§ 30, second subsection) states:
We can easily see that the act and its regulations were written before the World Wide Web arrived. Filling the request for two copies of each generally available Norwegian Web document is simply impossible. Today, the National Library is investigating the most-effective ways to fulfill the intent of the act as applied to digital documents and is considering the possibility of using a combination of different collection approaches.
The Paradigma Project began in August 2001. Its goals are to develop and establish routines for the selection, collection, description, identification, and storage of all types of digital documents and to give users access to these publications in compliance with the Legal Deposit Act. The project is scheduled to end on December 31, 2004.
Paradigma's activities fall within the bibliographic, technical, and legal areas, as reflected in its eight work packages:
Currently the project continues the National Library's earlier work in several of these areas. At present, activities from several of the work packages are under way or completed. The following sections highlight the work connected to the legal deposit of Web materials.
Aspects of the Collection Strategy
Based on recommendations from the Paradigma Project, and with the Ministry of Culture and Church Affairs' approval, the National Library has decided to start the general harvesting of all generally available digital documents from the Norwegian Web space (".no"). In time, documents found on domains such as .com, .org, and .net will also be harvested.
There are several reasons for taking this general harvesting approach. First, we cannot predict which documents will be of value in future research and documentation. Second, digital storage is becoming cheaper every day. Third, unfiltered harvesting saves us from resource-consuming manual selection at harvesting time. Finally, a Web Archive user can find documents via free-text search functions, thus being able to review all documents, including those that do not qualify for manual cataloging. Selection criteria for any use, such as further bibliographic description, can be challenged and changed at any time. This would, of course, be impossible if the material were excluded at harvesting time.
Total harvesting of the Norwegian Web space does not exclude the library's use of other collection strategies as well. The Legal Deposit Division carries out event-based collecting. It has collected, for example, the Web sites belonging to political parties prior to, during, and after elections. This type of capture activity will continue to supplement future routine harvesting rounds. A selection of Web documents is currently harvested semi-manually using the HTTrack software, and these are cataloged for the National Library's catalog (BIBSYS). This activity will continue until the Paradigma Project's general harvesting activity and related procedures are fully established.
In many cases other methods must be used to collect digital documents. The Legal Deposit Division has already contacted Norwegian publishers about the deposit of e-books, and the library's Sound and Image Archive is working with the Norwegian Broadcasting Corporation on solutions for the deposit of "born-digital" radio and television programs. However, a large amount of administrative, legal, and technical work remains, and the deposit of dynamic publications (e.g., Web newspapers and electronic materials of all types that are stored in databases) is especially challenging. The Paradigma Project will address these problems as the project continues.
Today the National Library of Norway registers different types of material in various ways. Ephemeral material is given an abbreviated cataloging treatment, while books and serials are given a full bibliographic description, both in the library's catalog and in the National Bibliography.
The Paradigma Project estimates that less than 1% of the material collected from the Norwegian Web space may be subject to individual manual treatment or registration at some level. After surveying selection criteria used in other countries and in the National Library's own divisions, the project suggested selection criteria and harvesting frequencies for new types of electronic publications that are based on content (genre). We also suggested a typology based on Shepherd and Watters's work and have used three main types of digital documents: traditional, i.e., similar to printed documents (monographs, periodicals, reference works, etc.); transient, i.e., based on traditional forms but extended with new functionality (net newspapers, Internet novels, etc.); and new, i.e., previously nonexistant, such as blogs and Web portals.
Automatic Processing and Analysis
We are currently investigating the use of automatic analysis and extraction of information (metadata) from Web documents. Such analysis can be used to generate "weighted" hit lists, thus helping librarians to select documents for manual registration. The technology is not yet good enough to determine a document's type automatically, but it can help to reduce the number of documents that require human intervention. For documents that are not evaluated manually, properties of a document type that are automatically captured can be made available for structured searching in the Web Archive. The value of these properties will be limited but, in combination with other search criteria, may indeed prove useful.
Metadata and Unique Identification
The Paradigma Project is also surveying metadata standards for the description of digital documents and for the exchange of bibliographic data. These recommendations may form the basis for a service to publishers and other interested parties, allowing them to generate metadata descriptions for their digital documents before legal-deposit delivery.
The library must be able to handle a huge number of small data objects automatically, and it will need to identify each component (text file, picture file, sound file) in a single Web document. We are currently surveying standards for identification, and we will suggest how to improve the library's existing identifier allocation service. One enhancement would be the ability to handle chronological versions of a Web document.
The exact size of the Norwegian Internet domain is unknown at this time. The first harvesting round, in December 2002, resulted in some 3.1 million URLs, of which approximately 53% were images (.jpg, .gif, .png). The NEDLIB-harvester started with about 1,000 initial URLs, and harvesting was limited to the HTTP protocol, to the Norwegian national domain (".no"), and to URLs without a search query attached.
Assuming a distribution similar to that found in Sweden and Finland, we expect to find 45% to 55% of the Norwegian Internet sites in domains outside .no. We expect future rounds to span roughly ten million URLs, especially when we include Norwegian sites in domains like .org, .net, and .com, as well as URLs with search queries.
The first harvesting round retrieved files requiring 140GB of space in the National Library's Long-Term Preservation Repository. File sizes will probably grow in the future. The space requirement estimates for the Norwegian Web space are based on an average of 100KB per URL. We expect the first complete harvesting round to be approximately 10 million URLs, thus filling around 1TB.
1 TByte represents roughly 1% of the total capacity of the Long-Term Preservation Repository. We expect that less than 10% of the storage capacity will be used by the Web Archive, even if both the number of objects and their average size grow drastically in the future.
Issues of Access Strategy
Legal Deposit Act
Section 1 of the Legal Deposit Act restricts access to source material for purposes of "research and documentation." These terms are not defined in the act itself, so the underlying intent of the act must be studied in a bill from 1988-89. Loosely translated, that document says:
Using this document as a guide, the National Library interprets research to mean investigation or inquiry at a certain scholarly or scientific level and documentation to be investigation or study without the same status as research in the traditional meaning of the word, but based on a systematic use of source material.
The general public has never been defined as a user of the traditional legal-deposit materials, but because public libraries generally do not maintain collections of previously published digital documents in the same way they maintain collections of traditional material, the Paradigma Project has recommended that a larger user group be given access to the deposited digital collection in the future.
The National Library strives to be the nation's foremost source of knowledge about Norway, Norwegians, and Norwegian conditions at home and abroad. We should consider whether Norwegian legislation permits a researcher on the other side of the globe to gain access to the Web Archive. Such access is technically possible, but it must be considered from a legal perspective. The Copyright Act regulates a copyright owner's intellectual and economic rights. We are painfully aware of the "conflict" between the Norwegian Legal Deposit Act (saying that digital source materials must be made available for research and documentation) and the Copyright Act (strictly limiting user access to digital documents). This concern is especially relevant to digital documents available via networks.
The conflict is understandable, considering that many digital documents are associated with commercial interests. A single electronic item on the loose can quickly be distributed all over the globe, possibly resulting in economic loss for the copyright owner. Digital documents can easily be misused (copied, manipulated, etc.). For that reason the National Library can give access to the Web Archive only to users defined in the Legal Deposit Act and then only from a PC designated for such use on the library's premises.
Norway is bound by several international copyright conventions. The recently passed Common Market Directive 2001/29/EF (22 May 2001) on the harmonization of copyright law is scheduled to be implemented legally in Norway this year. We are watching this process closely, as it can influence the way in which the National Library allows access to its Web Archive.
The purpose of the Personal Data Act is to protect persons from violations of their right to privacy through the processing of personal data. The National Library must process the digital documents that have been collected from the Norwegian Web space. Because many of these documents may contain personal data, the library received permission from the Data Inspectorate before initiating the first harvesting round. We are now authorized to collect and store Web material in 2003, but before giving access to the collection, we must secure permanent permission to do so.
For user access to the Web Archive, the Paradigma Project selected the Access Module developed by the Nordic Web Archive Project. (NWA). The five Nordic national libraries have now embarked on the next project, NWA-II, in which this software will be further developed. We plan to adapt the NWA Access Module to accommodate several special-user functions, including tailored interfaces for catalogers, program operators, and library patrons. This user interface will show a timeline enabling users to select different versions of the same document as captured on specific dates.
The NWA Access Module may play an important part in the collaboration between the Internet Archive and several national libraries in their combined efforts to develop software in the projected National Library Web Archive Consortium.
Our Digital Cultural Heritage
The Paradigma Project's work will be finished in two years. Hopefully, by then the National Library of Norway will have the technology, methods, and organization necessary to enforce the Legal Deposit Act—also for the many documents that are born digital.
CAMiLEON: Emulation and BBC Domesday
In December 2002 a group convened at the University of Leeds to demonstrate and discuss CAMiLEON's work in preserving the BBC Domesday project (a social record of UK life in the 1980s), which is now in danger of being lost through technological obsolescence. BBC Domesday was created to celebrate the 900th anniversary of the Domesday book of 1086, the original record of William the Conqueror's survey of England.
The meeting brought together some of the original BBC Domesday videodisc developers, including Peter Armstrong from the BBC, Ecodisc's Roger Moore (who brought an original glass master of one of the Domesday discs), two of the editors, and some individuals who had contributed to the content as schoolchildren; experts on digital preservation; and others interested in developing modern interfaces to the original Domesday data, as well as some nostalgic computing enthusiasts.
There were three speakers. Armstrong, as the chairman of BBC Domesday, had been heavily involved in its production. He presented several interesting anecdotes and background material about the highs and lows of the making of Domesday. Then Dr. Tom Graham, from the Consortium of University Research Libraries (CURL), explained that digital preservation is our duty to future generations for both historical and technological reasons. Digital preservation needs to be understood by people at all levels, from data creators to end users, whether national, institutional, or individual. The final speaker, David Holdsworth, from CAMiLEON, has worked in Information Technology since the mid-sixties and is now an expert in digital preservation and storage. He described the choice of Domesday as a test case to demonstrate the many problems in digital preservation and how CAMiLEON has developed strategies to solve these.
There followed demonstrations of the BBC Domesday system running on the original hardware and also of CAMiLEON's modern emulation that provides an accurate reproduction of almost all the original functionality.
A Brief History
Armstrong, who had established the BBC's Interactive Unit to make educational multimedia, wondered if it would be possible to celebrate the 900th anniversary of the original Domesday book by producing a modern-day equivalent. It was an ambitious idea, but it captured the imagination. The plan was to give the first copy to Prince William, the "poetic successor to William the Conqueror."
Funding for the project—an estimated £2 million—was relatively easy to obtain. Multimedia was an exciting, upcoming technology, and people involved in education and national archiving, as well as computing, were keen to push it forward. The BBC put together a team of around sixty staff to develop the project and recruited pupils from over half the schools in the country to help produce the content. In all, around a million children were involved from 14,000 schools.
The map of the UK was divided into blocks, each measuring 4 x 3 km.—it is no coincidence that this is the ratio of a television screen—and each block was adopted by a school. As the UK consists of over 25,000 blocks, it was practical to cover only about half of these. It was difficult to find schools in the more-remote areas of Scotland and Wales, but the majority of England was well accounted for. Pupils investigated the land use; counted the number of doctors, post offices, and so on; and wrote articles about the people and buildings in their blocks. Each area was allotted twenty screens of BBC text and three photographs.
Back in the classroom, the school computers, with a large user base on BBC Micros, were used for data entry. The articles were sent on floppy discs to the BBC. The text was left unedited—any spelling mistakes or typing errors remained in the final print. The only alterations were prompted by the lawyers. They found that some descriptions of local characters "could cause us some problems."
Hardware and Software
Developing the hardware took two years. BBC's Interactive Unit approached Philips, the only manufacturer of videodisc players in Europe, to produce the laserdisc player. This was actually a SCSI device—the original SCSI specification had only just been confirmed—which meant a SCSI interface had to be developed for the BBC Master. The player looked like a large, slow hard disc to the computer. The BBC Master had a special read-only version of its Disc Filing System called VFS (Videodisc Filing System) and could be controlled using similar commands.
The laser videodisc player produced PAL video, and the BBC Master also produced a PAL-like video signal. The player carried a genlock and video mixing board to combine the computer and disc pictures. Other hardware was developed for the BBC Master, including a coprocessor and a trackerball. The trackerball featured three buttons, although the BBC Domesday's graphical interface made use of only two.
Logica wrote the software using BCPL, a forerunner of C. In total, over 70,000 lines of custom code were written.
"If we'd known the problems involved, we would never have attempted it," said one of the staff. The project was completed on time and on budget, thanks to the remarkable work of the team. Over 24,000 maps and 200,000 photos were processed. Remember that there were no digital copies to work from—the paper originals of the maps were quite literally "cut and pasted" together. Each map and photo was captured as a single frame of continuous videotape. These then had to be captioned and have their copyright cleared. As well, over 8,000 data sets (traffic congestion, radiation levels, etc.) were stored.
The size of the Domesday project was overwhelming. The budget of £2 million sounds like a lot, but the real cost must have been far, far more than that when the dedicated work of all the schoolchildren and volunteers is considered. It has been estimated that if you worked a forty-hour week viewing Domesday, it would take seven years to see all the information. One source calculated that it would have cost a quarter-million pounds for institutions to access that amount of data, which made the price tag of the Domesday system sound like a bargain.
When the plans for the project were announced, the estimated price was £1,100, but when Domesday came to market, it had increased to over £4,000. As this was too expensive for most libraries and schools, Domesday became a commercial flop. The first set of discs was presented to the keeper of records at the Public Record Office, to be placed alongside the original Domesday book.
Life went on. The BBC Interactive Unit developed a few other ideas but eventually folded when the director general decided there was no future in multimedia. Armstrong and a group of colleagues bought out the department and set up the MultiMedia Corporation. Reworkings of the Domesday ideas appeared in other forms: the 3D World Atlas (Domesday on a global scale) sold over a million copies; Oneworld.net  features Another Domesday, which focuses on global justice issues and is one of Kofi Annan's favorite Web sites. BBC Domesday became an icon, the granddaddy of interactive multimedia. And then it became obsolete.
How to Preserve a Time Machine
The CAMiLEON project (Creative Archiving at Michigan and Leeds Emulating the Old on the New) has spent three years developing strategies for digital preservation and testing them with materials such as the BBC Domesday system. The BBC Domesday project encapsulated many difficult problems encountered by those working in the field: a huge amount of multimedia data, technological complexities, and the intellectual property rights (IPR) issues.
There are several aspects to preserving BBC Domesday. First is the decay of the media—discs get scratched during use and become less reliable. The hardware to read the discs is rare, and the few remaining laserdisc players are prone to break down (and require very specialized repair). All the hardware is long past its shelf life. BBC computers have always been durable, but not many were produced with the special Domesday extras. The Domesday system also has a particular look and feel that requires preservation in addition to the actual content.
Rescuing the Resource
CAMiLEON obtained access to a semi-working Domesday system donated by the School of Geography at the University of Leeds. One of the first tasks of preservation was to transfer the data files from the twelve-inch laserdiscs to modern hardware, storing the bytestreams in a media-neutral form. A Linux PC could be connected to the laserdisc player using a SCSI cable, allowing the PC to read the text articles and database. Images, including still-frame video, were transferred to a PC using a standard video frame-grabber card at maximum resolution. These images were stored in an uncompressed format to avoid quality loss or the introduction of artifacts (as can occur with JPEG compression). In total, around 70GB of image data was transferred per side of each laserdisc.
The next step was to develop software that emulates the adapted BBC Master computer and the laserdisc player on which the original BBC Domesday system ran. An open-source emulator—BeebEm—was used as the starting point for this software. Emulation of the specific Domesday system hardware had to be incorporated by CAMiLEON, which included the coprocessor, SCSI communication, and the many functions of the laserdisc player.
CAMiLEON's philosophy is to preserve the data in its original, unmodified format (i.e., the original abstract bytestream, not in the same physical medium). Software can then be written to use this data: perhaps an emulation of the original system, perhaps a tool that reformats it into a modern format, or perhaps software that provides a new interface to the data. This view builds on the ideas of the CEDARS project. For BBC Domesday CAMiLEON developed an emulation of the original system in which knowledge of how the original system worked is encapsulated. The emulation software, together with the abstracted data, provides a record of the original BBC Domesday system. A "black box" emulation of the laserdisc player was written to allow the emulated BBC Master to access the data recovered from the original laserdiscs.
To avoid the problem of emulation software becoming obsolete, it was important to ensure that the software was not chained to any specific operating system or machine architecture. Careful development with a clear focus on the goal of longevity will make it easier to run this software on a future (as yet unknown) computer, needing only a few simple (and well documented) modifications. This also means that it should be possible to port the emulation software to any current machine.
Currently the CAMiLEON BBC Domesday emulator runs only on Windows because, owing to time constraints, a Windows-based emulator, BeebEm, was used as the starting point. Because this was not written to follow guidelines for software longevity, it is tied to the Windows platform. The CAMiLEON team is currently seeking a small amount of funding to complete the software-longevity work and prepare the emulator for archiving.
Distribution and Copyright
Sadly, it is unlikely that Domesday will become available to the general public unless the IPR problems can be solved. The contents of the discs are heavily tied up in copyright—parts are owned by the BBC, the Ordnance Survey, and possibly the Local Education Authorities and schools. However, it may be possible for owners of original BBC Domesday laserdiscs to gain access to the preserved data and to make the emulator software publicly available. This would allow access in library reading rooms and some schools, for example. CAMiLEON is interested in examining and solving Domesday's IPR issues. Andrew Charlesworth discusses the issues in detail in "Legal Issues Arising from the Work Aiming to Preserve Elements of the Interactive Multimedia Work Entitled 'The BBC Domesday Project'".
The recent auction of a BBC Domesday system on eBay is evidence of the revived interest in the project. There are also a couple of people working to produce modern interfaces to the original Domesday data—they met each other for the first time at the CAMiLEON meeting.
CAMiLEON is keen to hear from any participants who worked on, or contributed information toward, BBC Domesday and would be interested in their views on the issue of making it available to the public. Please contact Paul Wheatley if you can help.
Highlighted Web Site
Squeezing More Life Out of Bitonal Files: A Study of Black and White. Part II.
Your editor's interview in the December 2002 RLG DigiNews states that JPEG 2000 can save space and replace the multitude of file formats used for conversion and display of cultural heritage images but that it isn't suitable for bitonal material. We have lots of bitonal images. Is there anything similar available for them?
In part I of this FAQ we examined the rationale for bitonal scanning going back to 1990 and reaffirmed its continuing relevance for digital capture of certain types of cultural heritage materials. We also considered the potential advantages and disadvantages of migrating collections away from the popular but aging TIFF G4 bitonal imaging standard. Here in part II, we'll take a first look at some of the alternative bitonal file formats and compression schemes. Part III, to appear in the June 2003 issue of RLG DigiNews, will compare the quality and performance of some specific products on a range of document content, including text, halftones, and complex graphics.
Several image file formats and compression schemes are potential migration targets for existing TIFF G4 files. Here's a rundown of some of the most important options, presented in alphabetical order.
Overview. Patented in 1991 by Cartesian Products, Inc., CPC is a proprietary compression scheme and image file format for bitonal images. Cartesian Products claims that CPC can compress substantially better than G4 and, though particularly well suited for text, that it outperforms G4 for all kinds of document content, including halftones. Unique amongst the technologies presented here, CPC does not have a lossless mode. Cartesian Products calls its method "nondegrading," meaning that after conversion to CPC, the original file can no longer be restored, but the differences cannot be perceived by the human eye (other vendors use the terms "visually lossless" and "perceptually lossless" for the same concept).
Advantages. CPC is a proven technology that has been adopted for some large bitonal image collections, such as JSTOR, which converted all its online journal holdings to CPC in 1997. There is a list of major CPC users available online (scroll to the bottom of the Web page). Cartesian's claims of nondegraded compression have been verified by user preference tests conducted by ISO. CPC supports single- and multi-page documents. Its viewer is available for all major platforms.
Disadvantages. CPC is proprietary, though Cartesian Products makes available APIs (application programming interfaces) to facilitate development of software using the scheme. Cartesian also claims that it is “working with a number of vendors who will be releasing CPC-enabled products, encompassing a broad range of applications including Internet fax services, document distribution, educational assistance, and electronic libraries.” However, at the moment Cartesian is the sole source of CPC encoders and viewers. CPC's lack of a true lossless mode could be an issue for demanding preservation applications. CPC is only for bitonal images. The format is not Web native and requires the installation of a special viewer if the CPC files are to be used for display purposes. CPC offers limited metadata capacity.
Overview. DjVu (pronounced like déjà vu) was developed by AT&T Labs in 1996 with the first publicly released products coming in 1998. DjVu is designed to be a comprehensive, all-in-one document solution, suitable for bitonal text as well as gray scale and color content. DjVu defines a document format and encompasses several different compression schemes. A layering scheme allows documents that combine text and continuous tone content to treat each separately for optimal compression and display. AT&T Labs sold the rights to DjVu to LizardTech, Inc. in 2000. The independent PlanetDjVu Web site is an excellent source of information on all things DjVu.
Advantages. Lossy (claimed visually lossless) and true lossless compression of bitonal images, both claiming considerably better compression than G4. Also handles gray scale and color. Viewer is available for all major platforms. Handles single- and multi-page documents. In December 2001 LizardTech released partial open source of the v3.0 DjVu Reference Library, and others further enhanced that library.
Disadvantages. DjVu is proprietary, though LizardTech makes available SDKs (software developer kits) to facilitate development of software for encoding and decoding. LizardTech will license the DjVu Reference Library only for noncommercial use. At this time LizardTech is the sole source of commercial DjVu products that adhere to the current standard. Though DjVu clearly has some very enthusiastic supporters, its adoption has been spotty. The DjVu Zone Web site (which has not been updated in over a year) includes an outdated list of current users. Two of the largest users cited, Heritage Microfilm's Historical Newspaper Archive and UMI's Early English Books Online have abandoned display of DjVu images in favor of PDF and GIF, respectively. It also offers limited metadata capacity. The format is not Web native and requires the installation of a browser plug-in for display purposes.
Overview. Developed by the Joint Bi-Level Interest Group, JBIG2 is a new compression scheme for bitonal images that became an ISO standard at the end of 2001. It is the only contender mentioned here that is an international standard. According to the introduction of the draft JBIG2 standard, "the design goal for JBIG2 was to allow for lossless compression performance better than that of the existing standards, and to allow for lossy compression at much higher compression ratios than the lossless ratios of the existing standards, with almost no visible degradation of quality." JBIG2 is a relatively new standard that is now starting to appear in commercial products.
Advantages. Nonproprietary. Supports both lossy and lossless compression of bitonal images, including a special mode for halftones. Considerably better compression than G4, especially for halftone images. Can theoretically be incorporated into several existing file formats, such as TIFF and PDF.
Disadvantages. JBIG2 is strictly a compression scheme, so it is up to developers to incorporate it into existing file formats. Certain functionality, such as metadata, depends on what file format is used. Some applications can now produce JBIG2-encoded PDFs, but only Acrobat Reader 5.0 and later can decode them, potentially limiting user access. An open source decoder is being worked on but is in the very early stages of development.
Overview. Adobe's PDF is itself neither an image file format, nor an image compression scheme. However, PDF can serve as a container for digital images compressed with several schemes, including G4 and JBIG2. PDF has been around since 1993 and has evolved over the years as the leading format for online distribution of complex documents.
Advantages. PDF supports single- or multi-page documents. For example, although it doesn't reduce their size significantly, individual TIFF G4s can be bundled into multi-page PDF G4s, making them directly accessible to most Web users and automatically gaining all the navigation and display control offered by Acrobat Reader. Though proprietary, Adobe has maintained PDF as an open specification, resulting in a substantial level of third-party support. A free viewer is available for all major computing platforms. PDF is well established and is now the subject of a fledgling effort called PDF/A to "develop an international standard that defines the use of the Portable Document Format (PDF) for archiving and preserving documents." See also "Archiving and Preserving PDF Files," by John Mark Ockerbloom, in the February 2001 issue of RLG DigiNews.
Disadvantages. Despite the open specification, PDF is still a proprietary format. Acrobat Reader can decode JBIG2-encoded images, but only since version 5.0. Users who haven't upgraded to version 5 will get an error message if they attempt to read a JBIG2-encoded PDF. There are hundreds of tools for converting to PDF, but they must be evaluated carefully since there is considerable variation in the quality of output and efficiency of display. It needs better metadata capability (PDF/A may address this). The format is not Web native and requires the installation of a special viewer, though Acrobat Reader and the Acrobat browser plug-in are widely deployed.
To be continued ….
Are any of these migration targets for existing TIFF G4 images appropriate for your collection? Much will depend on institutional circumstances and priorities and the nature of the documents in question. In part III of this FAQ, to be published in the next issue of RLG DigiNews, we'll look at some specific implementations and reassess migration risk in light of all our findings.
 JSTOR still scans to TIFF G4 and considers those files its preservation masters. The CPC files are used to reduce storage requirements for its online collection but are converted to GIF for display and to PDF for printing. (back)
Calendar of Events
Seminar on the Preservation of Web Material
Co-hosted by the University of Kerkira and ERPANET, the seminar intends to provide a detailed analysis of the main initiatives in the area of Web archiving and to discuss the long-term preservation of material obtained from Internet sources.
Media Group of AIC Meeting
EMG program activities will take place at the American Institute for Conservation's 31st annual meeting. Activities include the workshop, "Identification and Care of Videotapes,” a special joint session with the Photographic Materials Group, and a presentation of “PLAYBACK,” a new interactive DVD on the subject of analog video preservation.
Savings: Preserving Audio Collections
Sound Savings will feature talks by experts in audio preservation on topics ranging from assessing the preservation needs of audio collections to creating, preserving, and making publicly available digitally reformatted audio recordings. The School of Information at the University of Texas at Austin, the Library of Congress Preservation Directorate, the National Recording Preservation Board, and the Association of Research Libraries are co-sponsors.
Preservation Management: Short-Term Solutions to Long-Term Problems
Cornell University Library will offer a new digital preservation training program with funding from the National Endowment for the Humanities. Institutions are encouraged to send a pair of participants to realize the maximum benefit from the managerial and technical tracks that will be incorporated into the program. This limited enrollment workshop has a registration fee of $750 per participant. Registration is now open for the August workshop. A second workshop is scheduled for October 13-17 (registration will open this summer). There will be three workshops in 2004.
Resources for the Humanities Conference
The annual Digital Resources for the Humanities conference will concentrate on the following themes:
Summer Institute at the University of New Brunswick
The course will focus on creating a set of electronic texts and digital images. The program is designed primarily for librarians and archivists who are planning to develop electronic text and imaging projects, for scholars who are creating electronic texts as part of their teaching and research, and for publishers who are looking to move publications to the Web.
Museum Technology and Transformation
The Museum Computer Network annual meeting will focus on ways in which technology influences work, the museum programming, and the way we think about museums and cultural heritage.
Force on Digital Repository Certification
of Congress Announces Approval of Plan to Preserve America's Digital Heritage
Report for the Pilot Project
Joint Digital Repository Project for University and MIT
Research Publishes White Paper on the Economics of Digital Preservation
University Library System Launches Archive on Formation of European Union
and Collaboration in Digital Preservation: an RLG-JISC Symposium
Leading speakers from the USA and Europe described their experiences and future plans. Through presentations, discussion, and breakout groups, participants had opportunities to contrast different approaches, consider which approaches were relevant for their own institution and interests, and further explore opportunities for collaboration in digital preservation across organizational and national boundaries.
This event was the latest in a series of collaborations between JISC and RLG begun in 1996 and resulting in conferences, research projects and publications. Full proceedings from the symposium will be available on the RLG web site in the next week.
RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org.
Materials in RLG
DigiNews are subject to copyright and other proprietary rights. Permission is
hereby given to use material found here for research purposes or private study.
When citing RLG DigiNews, include the article title and author referenced plus
Please send comments and questions about this or other issues to the RLG DigiNews editors.
Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Erica Olsen; Copy Editor: Martha Crowe; Production Coordinator: Carla DeMello; Assistant: Valerie Jacoski.
All links in this issue were confirmed accurate as of April 15, 2003.