Virtual Archives

David Bearman
ICA Meeting, September 1996
Beijing



Abstract:

In an era of electronic communications, there is less rationale for gathering archives in one physical place than there has traditionally been. Indeed, many arguments of practicality and expense can be raised against planning on physical archives in an electronic age. However, before we can move to completely virtual electronic archives, all electronic records must have associated with them the metadata that gives them their recordness and enables us to manage them appropriately over time. This metadata must be grounded in an understanding of the functional requirements for recordkeeping and the meaning of evidence and it must support all the processes from creation to destruction or access that will need to be applied to the documents. These metadata, called here the metadata of "Business Acceptable Communications" will serve as the foundation for virtual archives and provide us with the ability to re-invent the archives of the future along distributed lines, with great benefits to the archives profession and to researchers.

Introduction

Within the next decade, almost all organizational records created in our society will be made and communicated electronically. As a consequence, in as little as a generation, the vast majority of all organizational records ever created will be electronic. Because the location of electronic records has little to do with their accessibility, we will then be able to access any records from anywhere with equal ease. As a consequence, today's physical repositories of archives based on custody, will become nodes in virtual archives which also include records outside archival custody but under archival control. The virtual archives of the future will be maintained not through accessioning, preservation and provision of on-site access, but through the control of information about records, and its use to ensure retention, disposition and access.

No new technology is required to achieve this end, all that is necessary is that we adopt standards and methods today that will lead to this result. Achieving this goal would simultaneously reduce the overheads involved in physical care of records and the management of their disposition. Without such standards, the expense of migrating records across software and hardware generations will greatly exceed that of managing our current paper-based archives, but with the adoption of appropriate standards, the marginal cost of on-going management of archival records will be minimal.

This paper does not explore the professional and organizational aspects of such re-engineering of the archival profession, which have been dealt with elsewhere.[1] Neither does it examine the technical case, or costs, of reconfiguring paper-based archives, because this case is being made in practice by Digital Library projects throughout the world.[2] Rather it focuses on two critical barriers to ubiquitous creation and management of electronic records:

First, the assurance that such records satisfy the requirements for evidence and

Second, methods by which records can be made available over time without constant re-presentation and migration of their intellectual contents. Unless we can ensure that these two requirements are satisfied in electronic environments, no electronic records systems can be archival. If these two criteria can be satisfied, and we act to implement these methods in all electronic communications systems, all electronic environments will be able to support the functional requirements for recordkeeping and no special recordkeeping systems need be implemented to redundantly hold and provide access to archival records.

Archivists are far from alone in seeking to define and implement environments that will ensure the creation and manageability of electronic records. In business arenas from commerce to health care, and from research and development to manufacturing, managers are seeking to define standards for data interchange adequate for their business purposes because such electronic systems will make their businesses more effective. The literature is replete with jargon laden discussions of how to enable end-to-end electronic business interaction to support such new processes as would be enabled by electronic patient records or electronic laboratory notebooks or to satisfy documentation requirements imposed by process oriented quality standards such as CALS or ISO-9000.[3]

Lawyers and auditors place demands on network designers to ensure that electronic records, satisfying requirements for evidence and manageability, are created in communications environments within and beyond the boundaries of the organization. Already they are encountering requirements to identify records, control access to them, manage software dependencies which will impact on the usability of records over time, and find ways to represent the contextual significance (business meaning) of the records that have been created.[4] Indeed many critical observers have argued that unless we can satisfy requirements for "integrity", "authenticity", "reliability" and "archiving" of digital information, the National and Global Information Infrastructures will never be able to support serious work.[5]

Archivists have not entirely ignored these emerging requirements.[6] At the University of Pittsburgh School of Library and Information Science, faculty and students engaged in a research project funded by the National Historical Publications and Records Commission have been examining the "Functional Requirements for Recordkeeping" and have developed a specification of the attributes of "recordness" or evidentiality.[7] The specification defines thirteen properties which are identified in law, regulation and best practices throughout the society as the fundamental properties of records. By formally restating these detailed specifications, the researchers have derived "production rules" or logical statements of simple observable attributes which when present, ensure that a system is creating and maintaining records as evidence.[8] At this level of specificity, the production rules, and hence the functional requirements for recordkeeping, can be demonstrated to be satisfied by the presence of specific information about the time, place and business function of the record creation, the ways the data are structured and the content of the communication itself. This specific information about records is called metadata. The functional requirement is satisfied if the metadata essential to record content and structure is inextricably linked to and retained with the metadata defining with the context of the business transaction it enabled. This metadata guarantees that the record will be usable over time, only accessible under the terms and conditions established by its creator, and have properties required to be fully trustworthy for purposes of executing business.[9]

The metadata requirements for evidentiality or "recordness" constitute a definition of information required for "business acceptable communications". Business acceptable communications could be enabled by a scheme for electronic envelopes containing business communications that would ensure that the envelopes could be opened by different computers in the future and their contents would still be accurate, understandable and meaningful. The metadata required to open such envelopes in the future, if always associated with the record contents, would also ensure the viability of virtual archives. Wherever the record was stored, it would always contain the information required for systems to manage its disposition and/or provide access. In order to guarantee such distributed archival functionality, standards must be adopted which prevent the creation and communication of records that are not business acceptable communications.

The need for such standards is widespread. Not only would they make communications received over networks trustworthy for the purposes of conducting business, and help to ensure accountability and protect organizations against the risks of loss of proof of their past behavior, they could, if properly specified, greatly simplify:

Since standards which capture record metadata must exist for any system or application, and be implemented at many different layers of software and hardware, it is necessary to define the relationships between these metadata standards and how the comprehensiveness of a given set of standards can be determined.[10] Such definitions of relationships between standards are called "reference models" because they serve as reference points for designers to structure the methods they develop to implement the standards. This paper introduces such a reference model, but first, to understand what data is necessary for business acceptable communications, we examine in more detail the nature of electronic evidence.

Electronic Evidence and the Functional Requirements for Recordkeeping:

Transactions (actions taken 'across' something) are by definition actions communicated from one person to another, from a person to a store of information, such as a filing cabinet or computer database, or from a store of information to a person or a computer.[11] As such, transactions must leave the human mind or the computer memory in which they are created and be spoken, written or read from elsewhere. Electronic transactions must be conveyed across at least one software layer, and typically across a number of hardware switches or connections, in order to be communicated.

Records are the carriers, products and documentation of transactions. Not all data is a record because not all data completely represents the transaction in which it was engaged. In fact, most information created by and managed in information systems, is not a record and lacks the properties of evidence. Records will only be evidence if the content, structure and context information required to satisfy the functional requirements for recordkeeping is captured, maintained and usable.

The functional requirements for capture, maintenance and use of the content, context and structure of records derive their warrant from law, regulation, and professional guidelines.[12] The requirements for recordkeeping are corporate requirements, not those of a single business function, and are therefore applicable to any organizational communications. These requirements are the foundation of good business practices and are essential guidelines in reducing risks from increased liabilities or decreased opportunities that accompany poor recordkeeping practices. Records oriented professionals within organizations, such as senior management, legal counsel, auditors, Freedom of Information and Privacy officials, and archivists all require records, and not just information, for their on-going work.

Organizations want to satisfy these requirements in the normal course of business, but it has been difficult to do so in the computer based communications environments we have installed in the past because applications software has not created or managed the metadata required by records. Information systems are generally designed to hold timely, non-redundant and manipulable information, while recordkeeping systems store timebound, inviolable and redundant records. Few, if any, in-house information managers have been able to devote the energy to rigorous definition of the distinct requirements for recordkeeping or, if they had, would be able to envision how to satisfy these throughout all systems. Without explicit and testable specifications, these systems have failed to satisfy the requirements for recordkeeping and are, therefore, a growing liability to companies even while they are contributing directly to day-to-day corporate effectiveness.

The "Functional Requirements for Recordkeeping"(FRR's)[13] dictate the creation of records that are comprehensive, identifiable (bounded), complete (containing content, structure and context), and authentic. These four properties are defined by the FRR's in sufficient detail to permit us to specify what metadata items would need to describe them and to audit systems for these properties. This descriptive metadata must be managed in a way that prevents its being separated from the record or being changed after the record has been created.

We can envisage a record as a metadata encapsulated object (content in an envelope with metadata on the outside), although in fact it might not be physically stored in this manner just as we might keep a paper case file separate from the indexes which identify it. When transmitted, the contents of the record are preceded by information identifying the record, the terms for access, the way to open and read it, and the business meaning of the communication. Metadata encapsulated objects may contain other metadata encapsulated objects, because records frequently consist of other records brought together under a new "cover", as when correspondence, reports and results of database projections are forwarded to a management committee for decision.

The metadata required to ensure that functional requirements are satisfied must be captured by the overall system through which business is conducted. The system here includes personnel and policy in addition to hardware and software. The metadata created with the record must allow the record to be preserved over time and ensure that it will continue to be usable long after the individuals, computer systems and even information standards under which it was created have ceased to be. Several additional requirements define how the data must be maintained and ultimately how it and other metadata can be used when the record is accessed in the future.

What metadata then should encapsulate a data object in our envelopes and how should it be structured? The metadata elements discovered in the analysis of the functional requirements for recordkeeping can be organized in six layers based on the functionality that needs to be supported. Each layer of metadata is composed of information relevant to specific hardware, software and organizational entities organized to do a specific job within the overall recordkeeping and communications environment. These layers are:

We have already mentioned that records may contain other records, and contain data from a variety of sources in any possible data format, so it is evident that the metadata must identify, or "register" a record uniquely. Only in this way can we know what data was in, and out, of a given transaction, and that the data in a communication is being declared to constitute a record. This record registration function is primary, and independent of all other functions of the record, so its metadata constitutes a discrete layer.

We also must know if the record is available for reading, and if so to whom and under what authority. For this purpose the encapsulation must next contain a layer of information related to terms and conditions of use. This metadata supports the functions of security control, permission negotiation and payment and sets any tables necessary to invoke redaction of record contents based on privacy, confidentiality or secrecy. Because these functions must be handled before the record itself is released and its subsequent metadata becomes available for analysis by the requester, this metadata also occupies its own layer and comes next.

Ideally all data objects that we want to communicate, that is the contents of all records, would be "interoperable" over a substantial period of time. To be interoperable, it must be possible to open and read them using computer systems other than those which created them. We attempt to achieve such inter-operability by encoding contents in standard formats to give them a degree of software independence, making them usable by software other than that which created them. However, many data objects that we create today cannot be standardized and the actual degree of software independence which can be ensured depends on how long any given "standard" can be expected to remain a standard. Therefore, the metadata with which we encapsulate records must flag the dependencies of the data (including their dependency on standards) so that a future review of the encapsulating metadata or "record headers" can locate potential sources of expiring standards or software dependencies and segregate records requiring logical "re-presentation" (called migration to new software formats) before they become unreadable. We record these dependencies in a layer of structure metadata encountered by the software in which the record will need to be represented. This metadata should enable the remaining information to be opened and meaningfully interpreted.

A specification of the metadata required to define the management requirements for evidence of business transactions arising from distinct processes can ensure a reduction of corporate risks and support formally auditing the business system. It enables us to locate where and how software, hardware, procedures and policies surrounding a system contribute, or fail to contribute, to the creation, maintenance and use of evidence. While no system of management can be self-auditing, a communications system built to ensure that appropriate metadata is captured for evidence can support a level of management accountability that it was never previously possible to implement or enforce. For these purposes we need to maintain contextual metadata, contained in a layer that precedes content so that it is always present as provenance.

Following the contextual layer, comes the content itself. This is the data which was engaged in the transaction, and it may take any form. Technically this is a BLOB (binary large object) which may contain text, images, sound, and even links to other systems and data in executable form (although such dynamic elements will make the long-term preservation of such records highly problematic).

Finally, our concept of evidence makes it important to know when records were used and how, in what ways they were filed, classified and restricted in the past. We also need to know if records have been destroyed, when and by whom and under what disposition authority that act took place. It is also important to us to know what redacted versions of records were released over time. Our traditional recordkeeping and archival control systems enable us to document such important events in the lives of records, if at all, only on an aggregate level. Electronic records are communicated each time they engage in a transaction such a filing, classification, destruction or release, therefore transactional data reflecting the history of records use will be generated by electronic recordkeeping systems, but instead of such data residing only at aggregate levels, electronic records metadata structures enable us record this data for each record. An important class of metadata is, therefore, historical transactions data. This data is retained in a layer at the end of the record to allow new transactions to be added to the data stream in the order of their chronological occurrence without having to open the record.

The Metadata of the Reference Model

The individual metadata elements discovered in the analysis of the production rules for the functional requirements for recordkeeping can be clustered by their intellectual commonalties. Elements referring to the same aspects of the record are "clustered" in this way for convenience of reference, use and representation. Each cluster may consists of some mandatory and some optional metadata elements, and sometimes of a potentially extensible set of data elements specific to a business process. The clusters themselves are considered part of the reference model and must always occur.

The metadata content directly related to satisfying requirements for evidence is always mandatory. Hence evidence, required for the conduct of business and for accountability, is ensured by a Metadata Encapsulated Object conforming to the reference model for "business acceptable communications". The metadata content which contributes to recordkeeping, or management of records, but is not essential to evidence, is optional. Metadata content useful for specific business functions may be defined as mandatory for business in that domain or optional.

The clusters we have identified are:

Within each cluster are one or more metadata elements referencing specific properties of the record. These elements are detailed in the reference model itself but not discussed here in the interest of preserving the focus of this paper on virtual archives. The reference model, at this writing, is a proposal being presented to a variety of standardization bodies and professional community forums.13

Metadata Encapsulated Records and Virtual Archives

The reference model sketched above ensures the creation, preservability and deliverability of evidence over time. It also contributes directly to enabling virtual archives.

First, because the reference model can guide implementations that result in appropriate metadata even before standards are adopted that guarantee these outcomes.

Second, because we are observing a large number of standardization efforts in many parts of the information technology community which are creating standards for metadata encapsulated objects but are as yet uninformed by the Functional Requirements for Recordkeeping and could, as a consequence of the reference model, converge and produce outcomes consistent with creation of records.

It is important to note that in addition to ensuring that the data we capture is a record, and can serve as evidence, the metadata requirements established for the reference model are designed to ensure that communicated data objects are:

These highly desirable properties will be imparted by keeping records with metadata conformant to the reference model. In the absence of standards that conform to the reference model, these benefits could be guaranteed by capturing and maintaining this same metadata by design, policy or system implementation practices.

Although the reference model introduced here is still only a proposal, there is good reason to expect a model of this sort to prevail soon because if there was conformity in metadata encapsulation it would greatly simplify the engineering of networked, distributed, business communications systems and because if there is not, it will be very difficult to use networked communications for multi-lateral business purposes. Designing networked applications within a framework in which transactions are encapsulated by appropriate definitive metadata, allows us to ignore where records are stored or what specific systems created the records, and to concentrate instead on generic records management functions that locate records which are scheduled for disposition and carry out the appropriate actions, or locate records which require conversion from one standard format to another. On the other hand, even imagining a commercial transaction outside of the assurances of the reference model presumes a willingness on the part of the two parties to a transaction into enter into a specific bi-lateral contracts, such as those executed in support of EDI (electronic document interchange). Such bi-lateral agreements could not be executed for every possible business transaction, hence electronic systems will ultimately need to adopt standards such as those envisioned by the reference model. Needless to say, bi-lateral arrangements would undermine archival record retention while multi-lateral agreements make electronic recordkeeping possible for the same reasons they allow electronic business.

Conclusions

Virtual archives, as opposed to physical repositories, will exist when users anywhere can access records anywhere without having to be aware of the source from which they came. Virtual archives will make use of the fact that electronic storage and retrieval of records is most effective when records are accessible from sources near the majority of the users. In a distributed environment, the actual storage location of records can follow patterns of use if all records are managed in a uniform environment and carry with them the knowledge of how they are to be disposed, preserved, and accessed. This knowledge is embodied in metadata that must be kept with the records at all times. If the metadata that is kept with records ensures their recordness or evidentiality, and the metadata is represented in elements, clusters and layers that support recordkeeping management requirements, then virtual archives can succeed. The Reference Model for "business acceptable communication" proposes such a framework for metadata which is grounded in archivally sound principles. Virtual archives can, and will, thrive in an environment in which such a reference model informs standards and implementations of electronic communications systems.

Last Modified: 7/8/96 [kjb]



RETURN TO TABLE OF CONTENTS
Second Progress Report



MAIN MENU | Functional Requirements | Production Rules | Metadata Specifications | Glossary