THERAC-25 ACCIDENTS
 
adapted from a proprietary report from the AECL Research Co.. May,1994
 

The N.G.Leveson et al reference is a very complete account of these accidents, although it is rather heavily slanted towards procedure and legality rather than concentrating on the technological lessons that are to be learned. It is significant that the account is written by a Professor of Computer Science and Engineering and a Ph.D.candidate in Information and Computer Science who is also a lawyer, despite the fact that the disciplines most! involved are control system engineering, reliability engineering [particularly redundancy planning], and operations rather than software engineering.

The Therac-25 is a machine that uses radiation from an electrically powered accelerator to destroy malignant human tissue. Overall, it is in a high class of safety criticality,: because of the obvious possibility that it might destroy the wrong tissue. On the other hand, it is unconditionally safe to cut off the beam, and this can be done instantaneously, so that the fail safe principle can be readily applied.

In the use of a total of eleven machines at different places, six serious accidental overexposures of patients occurred over a period of about 19 months. It seems fairly certain that five of these were the result of one obscure error in the computer program. It affected only one of many alternate paths through a considerable number of subroutines. and time was of the essence. It only caused trouble when the operator happened to correct a wrong keyboard entry quickly at a particular stage in the operating sequence. The sixth overexposure resulted from a more straight forward programming error. The failure to prevent or cope with a counting overflow in an assembler language program. It only caused trouble when a key was pressed at a particular instant in the sequence of operations

The harmful manifestations of both errors were therefore rare events. and this made them very difficult to diagnose. Also, trouble limit switches were at first suspected of being the cause. Both these situations are familiar and understandable in control engineering.

Both errors were clearly definable errors written into the computer program and were thus pure software errors in the narrow sense. Such slips occur continually in the early stages of the design of complex systems, in documents, drawings, hardware or software. The important issues are how the system as a whole failed to be designed to guard against their life-threatening consequences and how the errors came to escape detection in the overall engineering process.

The long and involved description, on pages 29 and 30 of the account, of the first error and how it came to cause trouble indicates a complex program. It was probably unnecessarily complex; the account makes it clear that the program was fragmented into dozens of separately called procedures or subroutines. This is exactly in accordance with conventional programming. The account makes it seem likely that an unconditional linear progression of the main line of action could have been followed, with any resumption after a pause or entry correction going right back to the beginning. If that had been done, the program pointer could not have found "sneak paths" that bypassed vital functions, which was what actually happened in the first five accidents.

The account indicates that a single programmer produced the program as a development of an earlier hardwired control system. It is stated that "In the Therac-25, software checks were substituted for many traditional hardware interlocks". This is clearly a matter of system design engineering, with particular reference to the use of redundancy. There is a clear inference that the programmer took over direction of the system design.

The organizational situation has much in common both with the Darlington problems. In all these cases elements of a previous satisfactory hardwired control system design were replaced by computers. In all three cases, control and direction of the design of the control system passed from the realm of control system engineering into that of "computer science". with a clear interruption of continuity from previous successful technology.

According to the account, the correct diagnosis of the main program error was delayed because it was at first suspected that limit switches on the turntable had failed. The proper design, adjustment and maintenance of limit switches is a more difficult branch of control engineering, than any software engineering, that they are likely to be coupled with, and has been the subject of, very much less theoretical study. The suggestion was also made that a position transmitter [potentiometer] should be added to give a redundant indication of the position of the turntable. Both of these are important and familiar issues in control system engineering

Both software errors were made much more difficult to detect in the ordinary engineering design process because the programs were written in assembly language. This seriously impairs the wide and deep review that is such an essential requirement in safety critical applications. Quite probably, because of this. Nobody but the single programmer referred to, and possibly a few other programmers concerned with maintenance, would have studied the program at all thoroughly.

 

Conclusions:

1. The system design should have included a redundancy plan, which would have included redundant testable guards against the clear danger of starting up the beam with the turntable or beam attenuators in the wrong position. This is a matter of system design and reliability engineering. Whether these requirements were implemented by hardware or software was not a primary issue.

2. The software errors should have been picked up in the normal engineering process but this was made difficult because the programs could not be reviewed by most of the people who should have been able to do so because of the combination of its complexity and the use of assembly language.

3. Although the account does not give enough information for a definite judgment, it seems likely that the program followed conventional practices, leading to much unnecessary complication. The authors of the account list "Designs should be kept simple" as a "basic software engineering principle", but. characteristically. it ranks third after two preceding bureaucratic items.

 
File:therac.doc
Date:12jan98/wfsp
 

            Previous      Next