Electronic Records Research 1997: Resource Materials

      Compilation Copyright, Archives & Museum Informatics 1998
      Article Copyright, Author

      Session V: Maintaining and Using Records

      Research Issues in Migration Strategies within an Electronic Archive
      Steve Binns, David V. Bowen, Alan Murdock, Jean Samuel, Tony Parsons

      Electronic Records Meeting
      Pittsburgh, PA
      May 29, 1997

      INTRODUCTION

      The primary requirement of an archive of electronic records is to allow users of the future to read and understand in their context and environment records archived in the current context and environment without loss of pedigree or sense. We use "migration" to refer to the process by which electronic records are changed to maintain access and utility. The two major types of change are changes in the computer hardware and/or software used to view or use the record, and changes in the computer hardware and/or software used to store and index the record (the electronic archive). Thus this paper will discuss issues with migration of records and of the archive in which they are kept.

      In order to maintain electronic records, they must be:

      1. identifiable,
      2. accessible, and
      3. legible or usable.

      In all three of these requirements electronic records differ significantly from physical records, and all three require significant research effort to define:

      • the techniques to achieve that requirement,
      • project management paradigms, and
      • cost-benefit equations

      The key difference is that a physical record can be seen and handled directly by humans. If it is protected from damage, it maintains its form and content and the seeing or handling will be possible at any future time. Physical records are normally labelled by physical labels, and these too maintain their existence over time. The techniques required to read and understand many sorts of physical record persist easily over 10s of years and often persist over 100s of years. Thus I can read a letter which was written to me 40 years ago, and I can read a book which is 150 years old. Indeed, with readily available training, I could read manuscripts which were over 1000 years old.

      Electronic records are stored as patterns of 1s and 0s represented in a magnetic or electrical form. Computer hardware and software is needed to find the record, to identify it, and to present it in a legible or usable form. Computers are changing on a 9 to 18-month cycle, and the changes are so severe that a record acquired two or three years ago may not be legible on a modern computer. The software that created that record may not run on a modern computer. Indeed, the computer on which that record was created may become difficult to keep operational, albeit on a 10-20 year timescale.

      The famous picture is that of the Rosetta stone surrounded by old (over 5 years) floppy discs and magnetic tapes. The Rosetta stone is the only record that can be read. The records on the other devices must be "migrated" to retain identity, accessibility and utility.

      The problem of maintaining electronic records is made more complex by the requirement to migrate the archive in which they are held. Since the electronic archive is a computer, it will become obsolete just as quickly as the systems that created the records. This imposes a second migration requirement and a second migration cycle on an electronic archive: source software and hardware must be migrated, but the archive itself must also be migrated.

      This paper sets out some of the issues and questions raised by the requirement to maintain electronic records. In delivering CEA Phase 1, we have answered some of the questions raised by the profession over recent years and gone on to raise new questions as our operating electronic archive delivered experience. We now have a good idea of the process point where migration alerts need to be raised; it is earlier than often recorgnised. We have met the issues of balancing user and custodian responsibility for electronic records, and of deciding between centralised system provision and stand alone, non-standard systems. This work has been reported at the DMM Forum (Brussels, 1996). We will share solutions where we have found them and raise new questions which we hope the archive and software communities will find challenging. We hope our practical experience will expand the profession's current repertoire of questions, which we know you have already found challenging.

      MIGRATION OF RECORDS

      Identifiable

      Since an electronic record may not have a fixed physical form, and if it does, it is not perceptible to a person without the aid of computing hardware and software, the first criterion for maintaining electronic records is that they be identifiable as records. This requires:

      1. a data model to define "record" in the environment under review
          1.1 are appropriate data models well-understood?
          1.2 can the data models be translated into templates and methods to be used in real electronic environments?
          1.3 how robust are data models over time?
          1.4 what are the implications of foreseeing unforeseen changes in the data model?

      2. some metadata to identify the record
          2.1 are present metadata definitions adequate?
          2.2 can they be translated into templates and methods and applied in actual archives?
          2.3 how robust are they over time?
          2.4 how well do the metadata stand up to new (unfamiliar) users?

      3. a system to allow users and custodians to use the metadata to "identify" the record
          3.1 what are the appropriate views of index systems for users ? for custodians?

      Accessible

      An electronic record will only be useful if it can be accessed by users and custodians. This implies that users (subject to some security model) must be able to find which records exist that are (or perhaps just 'may be') relevant to a query (a concern, a present activity). They must then (given appropriate security levels) be able to obtain a copy of, or a view of, the record. This raises some interesting questions:

      1. Security
          1.1 what security models are suitable for near-term (0-5 yr) use of records?
          1.2 what security models are appropriate for longer (10-50 yr) use of records within the organisation that created them?
          1.3 what security models are appropriate for use of historical records?

      2. Metadata
          2.1 what metadata are required for access?
          2.2 how do the metadata requirements change from near-term, to long-term, to historical use?
          2.3 do custodians need different metadata from users?
          2.4 Where does the divide come between indexing done by the creator/user and that done by the custodian centrally?
          2.5 What education is needed to support the users as we extend their responsibility beyond their paper based experience?

      3. Retrieval aids
          3.1 can the record contents be used as access aids?
          3.2 consider this for documents, for numercial records, for images? for sound?
          3.3 can retrieval aids be designed which survive over time (10 years or more)?

      Legible or Useful

      At some time before the software (or even the combination of hardware and software) that created a record becomes obsolete, it is important to consider how to maintain the record so that it can be read, or used. In this respect documents may be easier to manage than experimental data or images. The choices available to the record owners or custodians include:

      1. accept that the record cannot be maintained and destroy it
          1.1 what are the costs of this to an organisation?
          1.2 what are the costs of this to the historical record?

      2. accept that the record cannot be maintained but keep it (in case some future development allows it to be used again)
          2.1 Has this ever happened?
            [We have two examples of this: 1. Some 1970's biology and toxicity data, garnered from paper, were re input to a 1990s computer system with new analysis technology. These data thus gave us "future" value 2. We have scanned old SOPs for scientific work into computer stores. This has allowed "forgotten" (meaning not easily accessible and so not retrieved and used) methods to be either used in toto or updated. This was reuse of knowledge as opposed to data.]
          2.2 Is it likely? Can it be done as a demonstration?

      3. maintain a museum of software and hardware so that the record can be read or used
          3.1 what are the costs of this?
          3.2 can central museums provide a service to industry and to academic groups? [Can the PRO example of financing a commercial company to store data and migrate for them be extended?]
          3.3 Would software companies extend their business to offer archiving facilities?
          3.4 Should purchase and vendor selection criteria in the future focus on long term data (not just product) support?
          3.5 can the skills needed to maintain old hardware and software be identified and preserved over more than one generation?

      4. create a virtual environment to run the old software inside a virtual model of the old hardware running on a modern computer
          4.1 what are the costs of this?
          4.2 can service organisations provide these virtual systems to industry and to academic groups?
          4.3 can the skills needed to create and operate virtual models of old hardware and software be identified and preserved over more than one generation?
          4.4 can these virtual models really duplicate the old system in all respects? (in a validatable way?)

      5. migrate the records to an open standard format
          5.1 what are the costs of this?
          5.2 what is the actual lifetime of standards?
          5.3 do standards maintain the content of the record adequately? does this vary among documents, numerical records, images, multidimensional records?
          5.4 should standard formats be used to represent records in parallel to native formats? instead of the native format?
          5.5 what are the relative merits of 'open standards' (eg ASCII) versus 'commercial standards' (eg Microsoft Word)?
          5.6 is an image of a document record a valid standard format?

      6. migrate the records to a newer version of the software that created them
          6.1 what are the costs of this?
          6.2 how can we demonstrate that the content of the records was maintained?
          w6.3 hat are the costs and benefits of carrying forward (in parallel, as related but distinct records) an original, a standard, and a 'new version' form of the same original record?

      COMPLEX RECORDS

      The discussion so far has considered records which are:

      • documents
      • numerical records (spreadsheets, raw data, graphs, ...)
      • images (digitised photographs, raw data, ...)

      However the computerised record world is changing too rapidly for these categories to remain useful. Systems are already creating composite records, in which (for example) a document contains a spreadsheet, some graphs, and some images. "Contains" here can mean either "holds a static representation of" or (more importanly for our discussion of issues) "is linked to a live representation of".

      The second meaning of "contains" is especially important when multidimensional records are considered. For example, a medical image might be a collection of measurements in which each point in 3-dimensional space is represented by two or more intensities at several different times. A scientific measurement might combine a time-varying measurement with three different measurements of chemical properties at each time point. Each chemical property measurement might contain three or four dimensions.

      These multidimensional records cannot be represented by a single image: they can only be represented by software which selects a view. A different user, asking a different question, may need to see a different view. The view contained in a document is not the only view, nor the best view.

      • what multidimensional records exist? what multideimensional records are planned or are coming?
      • how can multidimensional records be archived?
      • how can they be migrated?
      • what does "source software" mean and how can it be managed?
      • what do "accessibility" and "utility" mean for multidimensional records?
      • what is the minimum acceptable representation of a multidimensional record?
      • are the rates of loss of multidimensional records different from those of simple records?

      A database is another sort of complex record.

      • What does it mean to archive a database?
      • How often should it be archived?
      • How can such an archived database be retrieved and used?
      • are there standards in producing and operating databases which can assist with the archive process?
      • how do we prevent the business community from confusing data warehousing with archiving?
      • Can data warehousing storage techniques (long trends needed therefore long storage times) be used for archival storage? Will datawarehoused databases be 'complex databases'?

      The World Wide Web presents another challenge to archiving. A Web page can include routines that respond to the user's commands. A record published on the Web can include links to other records, programs (applets) that act by themselves, and applets that respond to the user.

      • what does it mean to archive a WWW page?
      • what is the minimum acceptable representation?
      • what should be achieved? (our goals)
      • are there standards for producing WWW pages which will assist with good archiving practices?
      • who will migrate the Web itself?

      MIGRATION OF ARCHIVES

      Finally, the archive itself must be migrated. Both hardware and software may need to be changed, and the changes may come about slowly, with time for planning, or more unexpectedly, even as an emergency. For example, it is clear that even stable computer operating systems change versions every year or two, and most disappear from routine use after 10-20 years. Equally, computer data storage peripherals (tape drives, disk drives, ...) may become unreliable after 5-10 years, and may be unsupported by their original vendor. Sometimes a vendor will lose technical staff or cease trading, and support may collapse more suddenly.

      As a result of these pressures, plus the advent of new technologies, archives will migrate in small steps every year or so, and will undergo major migrations (to new hardware and software platforms) every 4-7 years. This raises issues, including:

      • what are the project management models for migrating an archive?
      • what are the risks, and how are they managed?
      • what are the costs of archive migrations?
      • how does migration of the archive interact with migration of the records?
      • what are the rates of record loss?

      CONCLUSIONS

      Although this is only a discussion document, you may still wish to ask why these topics were not given priorities. Should we not propose an order for tackling these problems?

      In fact the answer, at one level, is simple. The growth of electronic records in business, entertainment, academic research, and private use is so great that all of these issues must be tackled urgently. The penalty, if we do not solve electronic archiving quickly, is that we will lose a large part of our cultural and social history. The forces represented by technological change (and resulting obsolescence) in computing hardware and software are more devastating than war, and as inexorable. Furthermore, they are truely worldwide in their impact. No corner of the globe is free of computers, few aspects of human life are not affected by, and recorded in, computers. If we don't archive these electronic records soon, several generations (ours!) will leave drastically impaired traces for historians. We will also find ourselves unable to transmit what we have learned to our descendants.

      At the other end of the scale, since we know some topics will be studied, and others won't, the decisions on priorities are not ours to make. They will be determined by academic forces:

      • what do you want to study?
      • what do you have the skills and facilities to study?
      • what can you get support to study?

      and by commercial forces:

      • what affects our business?
      • from what can we create a business?
      • where are benefits greater than costs?

      and by governments and the public will:

      • what are businesses required to do?
      • what will government do itself?
      • what will government pay for?

      So, we hope you enjoy the next few years of research into Electronic Records. And please, please archive your research results carefully!


      ||| Meeting Schedule ||| Table of Contents ||| Index to Bibliography |||