C-ing ahead for digital longevity

David Holdsworth
CAMiLEON Project
University of Leeds
LS2 9JT UK

This is a second draft of Emulation: C-ing ahead.

It is a companion paper to Emulation, Preservation and Abstraction

Executive Summary

It is proposed that a (the?) most cost-effective technique for implementing emulation for long-term preservation is to use a widely available programming language for which there are good prospects for long-term availability. There is the further suggestion that a subset of the language be used to avoid those features that are unlikely to carry forward into subsequent languages. The proposal is that the language to use for now is a subset of C. We tentatively suggest the name C – –. We also propose that this subset is chosen with a view to automatic translation of software written in C – – into a subset of another more modern language, perhaps version 7 of Java. (Java currently stands at version 2.)

The author has practical experience is in the preservation of an operating system for an obsolete mainframe system of the 1970s. The techniques advocated in this short document are a direct consequence of this work, which has been done by writing emulation code in both C and Java.

Although it was emulation that led to the ideas expressed here, there is good reason to use the same language for writing migration tools.

Most of this paper is written for a computer science audience, for it is that community that is best able to judge whether the techniques recommended here do indeed have the potential to stand the test of time.

Background

We are concerned here with preservation techniques for digital objects whose original operation involved execution of program code that was a part of the object itself. Typical objects of this type are operating systems, self-start CDs (e.g. Encarta) and applications software.

Jeff Rothenberg has conducted an experiment in emulation as a preservation strategy. In this experiment he uses an Emulation Virtual Machine.

The IBM Almaden Research Center is involved in a project (see Lorie 2001) in which the intention is to design a Universal Virtual Machine (UVM), which is then used for actual emulator implementation.

My own work (with a colleague, Delwyn Holroyd) has implemented emulation of the ICL1900 system to the extent that we can run the George3 operating system, including its time-sharing feature. This also gives us access to software systems written to run under George3, including the world's first Algol68 compiler. This implementation has been designed to have a long life. However, we have chosen to use a programming language as the stable implementation platform, rather than the virtual machine approach of Rothenberg and Lorie.

In the search for continuity in an environment of rapidly changing technology where market forces push towards planned obsolescence, a few aspects of computing have shown long-term stability, largely on account of their widespread use throughout the IT industry. There then comes a point where the market forces actually want stability in order to protect investment.

The 3" diskette is a case in point. Although its value as a storage medium has long since ceased, its interchangeability led to wide use in data transfer. It is only recently that a 3" disk drive is not a standard part of every desktop system.

Some programming languages achieve even longer currency largely on account of the amount of software investment dependent upon them. Pre-eminent in this category is C.

Emulation in C and Java

The C programming language includes most of the constructs of structured programming, but still allows a very simple view of memory. This is particularly suited to emulation as we necessarily find ourselves concerned with an array of words of memory as we emulate the main store of the emulated machine. C also has a view of files as sequences of bytes, which is convenient to emulation of peripherals.

Java's view of memory is much more abstract, but still allows arrays of integers, which make a convenient representation of emulated main storage. Java has a multi-tasking model as an integral part of the language, whereas the thread facilities in C are a more recent feature of the language, and not necessarily supported on all platforms.

In our emulation of the ICL system, we have used C for emulating the main 1900 processor, and used Java for emulation of the communications processor (7903) for which the multi-tasking aspects are valuable. Although still imperfect in some respects, the system works well enough to evoke immediate recognition by those who know the original system, and to vindicate the techniques used in its construction. The emulation has run successfully on Win32, Irix, Solaris and Linux. We have not tried any other platforms.

Migration Tools

Migration costs are lowest when a migration-on-demand technique is used. This on-demand approach also has the advantage that the original is retained long-term and may be useful for analyses not conceived of when preservation was originally undertaken. The down side of migration-on-demand is need to keep the migration tools working long-term — just the same requirement as for emulators. This does mean that there is need for an environment for implementation of long-lived migration tools. Although C may not lend itself quite so well to the task of writing migration tools as it does to the writing of emulators, it is still quite viable in this area. We go so far as to suggest that by writing migration tools in our more restrictive C – –, the extra longevity achieved will more than repay any inconvenience in the programming language.

Longevity

We seek to implement emulation in such a way as to have it function successfully in the long-term future. The C programming language has undergone evolution from its Kernighan and Ritchie base to its current ANSI standard. In addition there has been the development of C++ (which has C as a subset). and Java (where many of the more hygienic aspects of C are clearly visible). There is so much software currently in important use and written in C (like UNIX for instance), that its continued availability is not under threat for some decades to come.

However, we wish to address the longer term. It seems unlikely that some alternative programming paradigm (e.g. functional programming) will completely eclipse the traditional style. When we look at C we can observe that many of its features are to be found in other languages. Here I am concerned with features at the semantic level. As an example, the assignment statement exists in C, Algol60, Algol68, Pascal, Ada83/95 and Java, to name but a few. There is a syntactic difference in that C and Java have x = y, whereas the others have x := y. On the other hand, there are features of C that have been deliberately discarded in newer languages, e.g. macros, address arithmetic, variadic parameter lists.

I propose that we recommend a subset of C in which to write emulators for long-term preservation. My personal experience is that the amount of work involved is by no means excessive. Our emulation of the ICL1900 was achieved as a spare-time activity over a period of about 18 months. We are both of us in full-time employment.

Tentative proposals for selection of the subset are in Appendix A.

The expectation is that over time it will become necessary either to modify the subset if it turns out to contain features that are removed from the language (indicating a bad choice of subset), or to move the policy to use a subset of a different language.

Implementation

We need a tool to enable confirmation that any emulator to be used for preservation (afterwards just referred to as emulator) uses only the preservation subset of C (subsequently refered to as C – –). We can readily achieve this at the syntactic level by taking the official LALR grammar of C, and removing from it the bits which are not in C – –. If we do this using yacc, we can further add in semantic checks to detect other transgressions.

A further opportunity is opened up by this approach. We must consider the time when C becomes computational Latin, and is replaced by another lingua franca (let us call it E). The C – – yacc parser could be the vehicle for implementation of software for the automatic translation of C – – emulators into E – –.

Of course, it is always possible that yacc may not last for ever either — but yacc is a C program, and could possibly be translated into C – – when the time came when it was no longer seen as part of the standard kit of parts. Making it generate E – – may be more problematic. On the other hand, the LALR algorithm implemented by yacc is a cornerstone of compiler implementation, and knowledge of it is unlikely to be lost.

It is inevitable, that the restrictions of C – – will make some things impossible. For a start, the desire to exclude variadic parameter lists would restrict the use of printf. We thus propose that a C – – program may require linkage with a small (and we stress small) section of code written in C. It is assumed that any future migration of the emulator away from C would involve hand coding of these small C sections. It may prove possible to make this C section common to more than one emulator.

Appendix A: C – C – –

Much more work remains to be done to decide the subset of C that is to be C – –, but the following aspects are meant as examples of the intended difference between C – – and C. The general principle is to exclude those aspects of C that are now regarded as examples of primitive language design, as evidenced by their exclusion from more modern languages, notably Java, but also Pascal, Ada and a few others.

Features for omission from C – –:

Restrictions on usage in C – –. In order to avoid being so restrictive that emulator implementation becomes overly burdensome, and to avoid protracted debates about contentious features, we suggest labelling some features as "deprecated", i.e. allowed, but discouraged. This approach to language development was introduced in the process that led to FORTRAN90, and is currently important in the development of Java.

Appendix B: Emulation Anecdotes

The details of the ICL1900 implementation are (or will soon be) described elsewhere, but one or two experiences bolster the ideas in this paper, and are thus worth including here. The major lesson is the value of actually getting the emulation to work at the preservation stage, or at least while related material (including human memory) is still accessible.

The George3 operating system which runs under our emulator was written in assembler by a team of programmers. As a result it seems to use every quirk of the machine's order code at some point. A final break-through into reliable operation came when we finally implemented a property of the overflow register that was not hinted at in the summary chart, and was detailed once in a thick four-volume manual. It seems likely that such a property might escape the specification process.

The source text of George3 was an invaluable reference from time to time. Some of the later features of the system's interfaces were not in the main stream manuals, although they may have featured in software notices. The thought of reading through many many hundreds of these was sufficient disincentive to make inspection of the source code a more fruitful way to investigate mysteries. One particular feature of the interface to the communications processor was only revealed by a comment in the source code, after which dim recollection of 25 year-old knowledge was sufficient.

During the early stages of the emulation work, there still existed a single live installation of George3. We took the precaution of getting this system to produce its diagnostic memory dump, so that we had an example of a real system in operation. Reference to this did occasionally help us to clarify aspects of interfaces whose documentation assumed knowledge that we no longer possessed.

We have deliberately steered clear of system-dependent features in our use of C. Each of the two authors is routinely using a different compiler (Visual C++ and Cygnus gcc), and from time-to-time checks out operability on other systems. We have factored out the parts of the code that are necessarily platform specific.

David Holdworth
August 2001