Emulation: C-ing ahead

David Holdsworth
CAMiLEON Project
University of Leeds
LS2 9JT UK

Disclaimer

This is an initial draft, and is certain to be revised early in the 21st century. For one thing, it will gain proper citations of other work mentioned. It is a companion paper to Emulation, Preservation and Abstraction http://www.leeds.ac.uk/CAMiLEON/dh/ep5.html

Executive Summary

It is proposed that a (the?) most cost-effective technique for implementing emulation for long-term preservation is to use a widely available programming language for which there are good prospects for long-term availability. There is the further suggestion that a subset of the language be used to avoid those features that are unlikely to carry forward into subsequent languages. The proposal is that the language to use for now is C.

My own experience is in the preservation of an operating system for an obsolete mainframe system of the 1970s. The techniques advocated in this short (I hope) document are a direct consequence of this work, which has been done by writing emulation code in both C and Java.

Background

We are concerned here with preservation techniques for digital objects whose original operation involved execution of program code that was a part of the object itself. Typical objects of this type are operating systems, self-start CDs (e.g. Encarta) and applications software.

Jeff Rothenberg has proposed that at the time of preservation of the actual bytes (or bits) that comprise the preserved object, we also preserve the specification of the platform upon which this object ran. My own experience suggests that it will be too difficult to capture all relevant information with confidence unless an emulation is actually achieved (see Appendix B below).

IBM and the British Library are involved in a project in which the intention is to design a Universal Virtual Machine (UVM), which is then used for actual emulator implementation. It seems courageous to me to christen something "universal" even before it has been designed, but perhaps if you are IBM you can make such nomenclature stick. The view taken is that a high-level language is too transient, and will not stand the test of time. History teaches us that some languages achieve such pre-eminence (and have such software investment dependent upon them) that they outlast virtual architectures, and the only hardware architecture that is in the same league is the IBM360/370/390, and that is younger than FORTRAN, although it can probably claim to pre-date C.

My own work (with a colleague, Delwyn Holroyd) has implemented emulation of the ICL1900 system to the extent that we can run the George3 operating system, including its time-sharing feature. This also gives us access to software systems written to run under George3, including the world's first Algol68 compiler.

Emulation in C and Java

The C programming language includes most of the constructs of structured programming, but still allows a very simple view of memory. This is particularly suited to emulation as we necessarily find ourselves concerned with an array of words of memory as we emulate the main store of the emulated machine. C also has a view of files as sequences of bytes, which is convenient to emulation of peripherals.

Java's view of memory is much more abstract, but still allows arrays of integers, which make a convenient representation of emulated main storage. Java has a multi-tasking model as an integral part of the language, whereas the thread facilities in C are a more recent feature of the language, and not necessarily supported on all platforms.

In our emulation of the ICL system, we have used C for emulating the main 1900 processor, and used Java for emulation of the communications processor (7903) for which the multi-tasking aspects are valuable. Although still imperfect in some respects, the system works well enough to evoke immediate recognition by those who know the original system, and to vindicate the techniques used in its construction. The emulation has run successfully on Win32, Irix, Solaris and Linux. We have not tried any other platforms.

Longevity

We seek to implement emulation in such a way as to have it function successfully in the long-term future. The C programming language has undergone evolution from its Kernighan and Ritchie base to its current ANSI standard. In addition there has been the development of C++ (which has C as a subset). and Java (where many of the more hygienic aspects of C are clearly visible). There is so much software currently in important use and written in C (like UNIX for instance), that its continued availability is not under threat for some decades to come.

However, we wish to address the longer term. I think it unlikely that some alternative programming paradigm (e.g. functional programming) will completely eclipse the traditional style. When we look at C we can observe that many of its features are to be found in other languages. Here I am concerned with features at the semantic level. As an example, the assignment statement exists in C, Algol60, Algol68, Pascal, Ada83/95 and Java, to name but a few. There is a syntactic difference in that C and Java have x = y, whereas the others have x := y. On the other hand, there are features of C that have been deliberately discarded in newer languages, e.g. macros, address arithmetic, variadic parameter lists.

I propose that we recommend a subset of C in which to write emulators for long-term preservation. My personal experience is that the amount of work involved is by no means excessive. Our emulation of the ICL1900 was achieved as a spare-time activity over a period of about 18 months. We are both of us in full-time employment.

Tentative proposals for selection of the subset are in Appendix A.

The expectation is that over time it will become necessary either to modify the subset if it turns out to contain features that are removed from the language (indicating a bad choice of subset), or to move the policy to use a subset of a different language.

Implementation

We need a tool to enable confirmation that any emulator to be used for preservation (afterwards just referred to as emulator) uses only the preservation subset of C (let us call it C – –). We can readily achieve this at the syntactic level by taking the official LALR grammar of C, and removing from it the bits which are not in C – –. If we do this using yacc, we can further add in semantic checks to detect other transgressions.

A further opportunity is opened up by this approach. We must consider the time when C becomes computational Latin, and is replaced by another lingua franca (let us call it E). The C – – yacc parser could be the vehicle for implementation of software for the automatic translation of C – – emulators into E – –.

Of course, it is always possible that yacc may not last for ever either — but yacc is a C program, and could possibly be translated into C – – when the time came when it was no longer seen as part of the standard kit of parts. Making it generate E – – may be more problematic.

It is likely, that the restrictions of C – – will make some things impossible. The desire to exclude variadic parameter lists would restrict the use of printf. We thus propose that a C – – program may require linkage with a small (and we stress small) section of code written in C. It is assumed that any migration away from C would involve hand coding of these small C sections. It may prove possible to make this C section common to more than one emulator.

Appendix A:     C   –   C – –

Much more work remains to be done to decide the subset of C that is to be C – –, but the following aspects are meant to examples of the intended difference between C – – and C.

Features for omission from C – –:

Restrictions on usage in C – –. In order to avoid being so restrictive that emulator implementation becomes overly burdensome, and to avoid protracted debates about contentious features, we suggest labelling some features as "deprecated", i.e. allowed, but discouraged. This approach to language development was introduced in the process that led to FORTRAN90, and is currently important in the development of Java.

Appendix B:     Emulation Anecdotes

The details of the ICL1900 implementation are (or will soon be) described elsewhere, but one or two experiences bolster the ideas in this paper, and are thus worth including here. The major lesson is the value of actually getting the emulation to work at the preservation stage, or at least while related material (including human memory) is still accessible.

The George3 operating system which runs under our emulator was written in assembler by a team of programmers. As a result it seems to use every quirk of the machine's order code at some point. A final break-through into reliable operation came when we finally implemented a property of the overflow register that was not hinted at in the summary chart, and was detailed once in a thick four-volume manual. It seems likely that such a property might escape the specification process.

The source text of George3 was an invaluable reference from time to time. Some of the later features of the system's interfaces were not in the main stream manuals, although they may have featured in software notices. The thought of reading through many many hundreds of these was sufficient disincentive to make inspection of the source code a more fruitful way to investigate mysteries. One particular feature of the interface to the communications processor was only revealed by a comment in the source code, after which dim recollection of 25 year-old knowledge was sufficient.

We have deliberately steered clear of system-dependent features in our use of C. Each of the two authors is routinely using a different compiler (Visual C++ and Cygnus gcc), and from time-to-time checks out operability on other systems. We have factored out the parts that are necessarily platform specific.

David Holdworth
December 2000