Dependability Research at Newcastle

Brian Randell

A formal research project, to investigate system reliability started at Newcastle in 1971, sponsored by the Science Research Council as it then was called. The proposal that led to this project had its origins in a single (very long) coffee break, in the then-subterranean Computing Laboratory common room, during which Jim Eve and I talked over how to react to the impending visit from some Science Research Council officials - who had asked to visit the Laboratory to find out why Newcastle had not been applying for funding from them. By lunch time we had the basics of an ambitious plan to study the problem of coping with residual (software) design faults. (This was at a time when the conventional academic research attitude to software faults was that there shouldn't be any!)

The Science Research Council were understandably taken aback by the level of ambition (i.e. cost) of our project proposal - and initially gave us just a small grant to undertake a survey of existing industrial practice related to fault tolerant computing - the report that was produced also served as the basis for an invited talk at IFIP entitled Operating Systems: The Problems Of Performance and Reliability [Randell 1971].

We eventually got our major funding - it enabled us to employ several young Research Associates, two of whom have been at Newcastle ever since, namely Tom Anderson and Ron Kerr, and to purchase four PDP11/45 computers. (We first had to persuade the Department of Trade and Industry that we should be allowed to purchase something other than Modular-1 computers, something they were very reluctant to accept even when we showed them a letter we had obtained from Iann Barron, the chief designer of the Modular-1, agreeing with our choice of equipment.) The project grew, and soon involved such others as Jim Horning, Hugh Lauer, and Mike Melliar-Smith. It was these three and I who in 1974 produced the first paper on recovery blocks [Horning, Lauer et al. 1974]. The Newcastle team, and others elsewhere, went on to develop these ideas extensively - a detailed account of these developments, entitled The Evolution of the Recovery Block Concept was prepared by Jie Xu and myself recently as the chapter for the book "Software Fault Tolerance" [Randell and Xu 1995].

Another development of lasting interest has proved to our work on improved definitions of the basic ideas involved in fault tolerance. In the 1970s hardware engineers took various particular types of fault (stuck-at-zero, stuck-at-one, etc.) which might occur within a system as the starting point for their definitions of terms such as system reliability and system availability. We felt in need of a more general set of concepts and definitions in order to deal with design faults. And of course we wanted these definitions to be properly recursive, so that we could adequately discuss problems that might occur either within or between system components at any level of a system.

The alternative approach that we developed, starting in 1977 [Melliar- Smith and Randell 1977] , took as its starting point the notion of failure, whether of a system or a system component, to provide its intended services. Depending on circumstances, the failures of interest could concern differing aspects of the services. The ensuing generality of our definitions of terms thus led us to start using the term "reliability" in what most other people thought was an unacceptably general way, for example to include "safety" and "security" as special cases! It was our French colleague, Jean-Claude Laprie of LAAS-CNRS, who came to our linguistic and conceptual rescue by proposing the use of the term "dependability" instead, so that "reliability" could retain its more conventional meaning [Laprie 1985].

Work has continued at Newcastle on dependability ever since. Tom Anderson, who had gone on to lead a major project on evaluating recovery blocks, using a Naval Command & Control application as the testbed [Anderson, Barrett et al. 1985] , later set up the Centre for Software Reliability (CSR). This Centre has since then housed a growing program of research, including Newcastle's contribution to the joint Newcastle/York British Aerospace-funded Dependable Computing Systems Centre. CSR has also been very active in technology transfer, operating two industrial "community clubs", originally sponsored by DTI but now self-financing with a total of 2,500 members.

Another outgrowth of the original reliability project was Newcastle's now separate, but still closely-related, research on Distributed Systems, led by Santosh Shrivastava. The reliability project had initially concerned itself with sequential programs, then with concurrent ones, and then with the problems of fault tolerance in distributed systems - almost by accident we invented the Newcastle Connection - but that story is told in the account of the work of our Dependability Research Group.

Work on security started in 1979, when at the request of the then-Royal Signals and Radar Establishment (now Defence and EngineeringResearch Agency) of the Ministry of Defence, we were invited to do a critical survey of current work in the (American) academic research community. The result was a (very!) critical survey by John Rushby; this paper was one of the earliest to argue for taking "proof of separation" as a basic starting point in building secure systems. This idea, and our Newcastle Connection concept, led to the Distributed Secure Systems concept - an idea that RSRE initially reacted to sceptically, but then classified for several years, during which they developed the idea very considerably in the MoD's first-ever Information Technology Demonstrator Project.

In 1986 John Dobson returned to the Laboratory from MARI (the Microelectronics Applications Research Institute, now MARI Ltd., that we, the London-based software house Computer Analysts & Programmers Ltd., and Newcastle Polytechnic as it then was, had created in 1979), to work on security funded by the MoD. One early result was a paper generalising the Distributed Secure Systems ideas, entitled (with a conscious bow towards von Neumann), "Building Reliable Secure Systems out of Unreliable Insecure Components". John's work on security modelling expanded, and has led to several European-funded research projects on the more general topic of Enterprise Modelling - a method of requirements analysis that he, Mike Martin and other colleagues have applied to several application areas, including medical informatics and telecommunications systems.

The most direct line of descent from the original reliability project, which had had funding over many years from the Science Research Council and its successors, and from the Ministry of Defence, has been our work within a succession of large collaborative projects, directed by Newcastle, and funded the ESPRIT Basic (now Long Term) Research Programme, namely PDCS and PDCS2 (Predictably Dependable Computing Systems) and now DeVa (Design for Validation). The best single reference describing this work is what we always refer to internally as "PDCS - The Book" [Randell, Laprie et al. 1995]. However our own present work within the DeVa project centres on Coordinated Atomic Actions [Xu, Randell et al 1995]. These are a generalisation/combination of the notion of a "nested multithreaded transaction" (as used for protecting shared resources that are being competed for), and of "conversations" (a scheme for structuring the provision of forward, as well as backward, error recovery among cooperating concurrent processes).

Let me finish this brief account by repeating the observation I made happily some years ago. The more computing systems are made dependable, the more dependent the world will get on them - hence there will always be a need for further research on the subject!

References

1. T. Anderson, P.A. Barrett, D.N. Halliwell and M.R. Moulding, "Software Fault Tolerance: An evaluation," IEEE Trans. Software Engineering, vol. SE-11, no. 12, pp.128-134, 1985.

2. J.J. Horning, H.C. Lauer, P.M. Melliar-Smith and B. Randell, "A Program Structure for Error Detection and Recovery," Lecture Notes in Computer Science, vol. 16, pp.177-193, 1974.

3. J.C. Laprie. "Dependable Computing and Fault Tolerance: Concepts and terminology," in Proc. 15th IEEE Int. Symp. on Fault- Tolerant Computing (FTCS-15), pp. 2-11, Ann Arbor, Michigan, 1985.

4. P.M. Melliar-Smith and B. Randell. "Software Reliability: The role of programmed exception handling," in Proc. Conf. on Language Design For Reliable Software (ACM SIGPLAN Notices, vol. 12, no. 3, March 1977), pp. 95-100, Raleigh, ACM, 1977.

5. B. Randell. "Operating Systems: The problems of performance and reliability," in Proc. IFIP Congress 71 (vol. 1), pp. 281-290, Ljubljana, Yugoslavia, North-Holland, 1971.

6. B. Randell, J.-C. Laprie, H. Kopetz and B. Littlewood, (Ed.). Predictably Dependable Computing Systems, Berlin, Springer-Verlag, 1995, 588 p.

7. B. Randell and J. Xu. "The Evolution of the Recovery Block Concept," in Software Fault Tolerance, ed. M. Lyu, pp.1-22, John Wiley & Sons Ltd, 1995.

8. J. Xu, B. Randell, A. Romanovsky, R.J. Stroud, and Z. Wu. Fault Tolerance in Concurrent Object-Oriented Software through Coordinated Error Recovery (Co-authors ) Proc. 25th Int. Symp. Fault-Tolerant Computing (FTCS-25), Los Angeles, IEEE Computer Society Press, 1995.

Contents Page - 40 years of Computing at Newcastle

Dependability Research at Newcastle, 18 October 1997