(Early) Distributed Computing Research at Newcastle

Santosh Shrivastava

In 1975 I joined Brian Randell's 'Reliability Research' group as a research associate (RA). I joined a crowd of RAs that was later to include Fabio Panzieri with whom I did my early work on distributed computing. We all had offices on the sixth floor of the Daysh Building. The Cambridge Ring had been just invented and thanks to SERC's Coordinated Programme on Distributed Computing (a wonderful and timely initiative from SERC - now called EPRSC - for supporting research in distributed computing) a Ring was installed on the Daysh Building. The accompanying software left much to be desired. (Of course, as we have come know, this is the normal state of affairs). It was almost possible for you to type a few characters on a workstation (PDP11/23 in our case) and run to the workstation in the next office in time to see the characters appearing. Fabio and I decided to investigate this matter, and a very fruitful partnership began. Our first joint document was on the description of the low level Ring protocol called the Basic Block Protocol (BBP) [1]. We were soon working on the design and implementation of remote procedure call (RPC) protocol [2] and produced useful software for invoking remote services. The crowd in the Daysh Building got to know of this software, and started experimenting with it.

Soon the sixth floor of Daysh was buzzing with activity with group discussions going on in the corridors and offices, with Brian managing to be present simultaneously everywhere. David Brownbridge, Lindsay Marshall experimented with the crazy idea of gluing Unix systems together to provide a single Unix abstraction [3] and John Rushby (now a senior scientist at SRI, but then an RA like me) used our RPC system to build a distributed secure system [4]. Fabio and I felt slightly miffed, as these systems were far more interesting than our RPC! However, we are very proud of having played - no matter how minor - a part in the creation of two very influential distributed systems that were invented here. Fabio and I went on to work on the problems of failure handling, orphan detection and termination for RPCs [5].

In the mid eighties, largely as a response the Japanese Fifth Generation Programme, the UK government initiated a multi-million Pound university-industry collaborative research programme in information technology (the 'Alvey Initiative'). I was able to acquire a modest grant on 'reliable distributed computing' from EPSRC under this initiative. C++ language had just been invented at Bell Labs, and a pre-processor for C was available freely. This was reason enough for us to start work on reliable distributed programming in C++ , and we chose atomic transactions as the main fault tolerance and consistency preserving mechanism. By late eighties, we had developed a tool kit for building reliable distributed applications using objects and transactions [6]. We made the software - named Arjuna - accessible via FTP and announced its availability on various news groups. We produced a string of papers and Phd dissertations, and continued refining our tool kit (see [7] for a detailed description of the system). In the mean time our group took on a life of its known, separate from the original Reliability group. Arjuna is now a mature technology; our own University uses Arjuna for the mission-critical task of electronic registration of students [8]. Two members of the original team of four that played crucial role in building Arjuna have left us; Graeme Dixon works for Transarc where he is (naturally) responsible for developing distributed transaction technology and Graham Parrington works for Sun Microsystems where he is developing fault-tolerant workstation clusters. Stuart Wheater and Mark Little are working here as senior RAs. We have developed CORBA-compliant versions of Arjuna in C++ and Java and the thrust of our research has moved on to using Arjuna for building higher level tool kits for composing and managing distributed applications. Here we are exploring construction of middleware services and frameworks and integration of our technology in Web based services and industrial platforms intended for telephone switching, electronic commerce and network management. In addition we are conducting research on bringing object technology to the Web [9], making use of our earlier work on object support system called Shadows [10]. Our research is being supported by a number of industrial grants, ESPRIT and EPSRC (for more information, visit our Arjuna web site).

Let me now describe our second (parallel) research activity called Voltan. A dominant assumption made in software-implemented fault-tolerant mechanisms, such as message logging, checkpointing, transactions, is that the processing elements will suffer only crash failures, i.e., a processing element will either perform correct state transitions or will crash by ceasing to function. To meet this assumption in a realistic manner, some form of self-checking facility will be required within an element to detect a faulty state transition and stop the element from producing any further outputs. It is nevertheless true that in most non-safety critical applications, it is common to assume that conventional processors, without any self-checking capabilities, will suffer only crash failures. In the Voltan group, we are interested in applications where self-checking is deemed necessary. It is prudent to design and implement such systems/applications under a highly unrestricted fault assumption, namely, that a failed processor and its processes, in principle, can behave in a fail-uncontrolled manner (in the literature this failure mode is often referred to as the Byzantine failure mode). While certainly not common, experience has shown that Byzantine failures cannot be ruled out in the design of fault-tolerant systems.

In 1982 I was on sabbatical at Indian Institute of Science, Bangalore, where I met Paul Ezhilchelvan who was completing his Masters there and showed great interest in working on protocols for tolerating Byzantine failures. Paul was able to join the Reliability group in 1983 as an RA and he and I began work on Byzantine failure-tolerant processor architectures. Later on we were joined by Neil Speirs and with the help of a few Phd students we developed three processor failure-masking and two processor fail-silent Voltan nodes [11,12]. We believe that we have cracked the problem of building such nodes entirely in software - a major advantage of our approach as the design of hardware-assisted self-checking capability is a very expensive and time-consuming activity. We are now exploring the application of Voltan technology in realistic industrial settings.

Looking back I realise now that my early years as an RA on the Daysh building were amongst the best. I formed lasting frendships with a number of scientist who happened to be passing by the Daysh. In addition to Fabio, I would particularly like to mention Flaviu Cristian, Jean-Pierre Banatre, Ike Best and John Rushby.

References:

[1] A Structured Description of Protocols for the Cambridge Ring (46K Postscript Document)

[2] S.K. Shrivastava and F. Panzieri, "The design of a reliable remote procedure call mechanism", IEEE Trans. on Computers, 31,7, pp. 692-697, July 1982.

[3] D. Brownbridge, L. F. Marshall, F. Panzieri, S. K. Shrivastava, C. R. Snow, B. Randell and H. Whitfield, "Naming In A Set Of Linked Hierarchical Systems or Unixes of the World, Unite!"

[4] Rushby/Randell: distributed secure systems..

[5] F. Panzieri and S. K. Shrivastava, "Rajdoot: a remote procedure call mechanism supporting orphan detection and killing", IEEE Trans. on Software Eng. 14, 1, pp. 30-37, January 1988.

[6]S.K. Shrivastava, G.N. Dixon, F. Hedayati, G.D. Parrington and S.M. Wheater, "A technical overview of Arjuna: a system for reliable distributed computing", Proc. of UK IT Conference, Swansea, July 1988, pp. 601-605.

[7] G.D. Parrington, S.K. Shrivastava, S.M. Wheater and M. Little, "The design and implementation of Arjuna", USENIX Computing Systems Journal, vol. 8 (3), pp. 255-308, Summer 1995.

[8] M. Little et al, "Using Arjuna to Implement the University Student Registration System" Tech. Report

[9] S.J. Caughey and S.K. Shrivastava, "Architectural support for mobile objects in large scale distributed systems", Proc. of 4th IEEE IWOOOS, pp. 38-47, Lund, August 1995.

[10] D. Ingham, M. Little, S.J. Caughey and S.K. Shrivastava, "W3Objects: bringing object-oriented technology to the Web", 4th Intl. WWW Conf., Boston, Dec 1995. The World Wide Web Journal, pp. 89-105.

[11] S.K. Shrivastava, P.Ezhilchelvan, N.A. Speirs, S. Tao and A. Tully, "Principal features of the VOLTAN family of node architectures for distributed systems", IEEE Trans. on Computers, 41, 5, May 1992, pp. 542-549.

[12] F. Brasileiro, P D Ezhilchelvan, S. K. Shrivastava, N. Speirs and S. Tao, "Implementing fail-silent nodes for distributed systems", IEEE Trans. on Computers vol. 45, no. 11, November 1996, pp. 1226-1238.

Contents Page - 40 years of Computing at Newcastle

Distributed Systems, 26 June 1997