| Comments

When Faulty Software Turns Deadly

Posted by Rob Ludwick

Last year my wife and I returned from a week-long conference to our home in Boise and then I immediately returned to Kansas City — practically hopping from one plane to the next — living the dream of a software contractor.

When did we decide that a software failure was okay for a refrigerator?


My wife called me the next day and told me that the fridge had stopped working. The buttons and display were inoperative and the fridge was warm. Literally, she had to 'reboot' it by sliding it out to get to the outlet, unplugging it and plugging it back in.

And magically the fridge started working again.

As it turns out, the refrigerator failed to cool anything for the previous week while we were gone. The cause was not mechanical, nor was it electrical either. No, the problem was with software.

So as I was listening to my wife telling me all about the horrors that were growing in the crisper drawer while being 2,000 miles safe and sound in my hotel room, two thoughts came to my head. The first was how lucky I was that I was not there to clean up that unholy mess. Because if you’ve ever had this happen to you, I don’t have to tell you. You know.

And the second was, when did we decide that a software failure was okay for a refrigerator? I know it may be hard to believe, but there was a time when software did not exist in a fridge. Refrigeration was purely mechanical at one point. A motor, a compressor, and a temperature sensitive spring — that’s it.

But over time, things got more complicated. We added electronics, and microprocessors, and software, under the belief that software somehow magically adds value to the consumer’s experience.

And it turns out that it does when the software works. But when it doesn’t work, the value added is the large furry black mold growing in the jar of mayonnaise. Of all the ways a fridge could die, I never thought it would be software-related.

It’s one thing when it’s just a fridge. It’s an entirely a different story when people are injured or hurt.

Therac–25

The Therac–25 was a radiation therapy machine built in 1982, a successor to the Therac–20.

In the Therac–25, the engineers decided everything should be managed by computer, with two major flaws. First, they replaced the hardware interlocking safety measures with software interlocks. And second, they used the same software from the Therac–20, the previous model.

Malfunction 54 meant there was either an overdose, or underdose, and the software couldn’t figure out which.


As it turns out the software interlocks could fail because of race conditions, exposing bugs in the old Therac–20 software that would have been prevented by the hardware interlocks. The turntable could get into an unknown state and the electron beam could fire in x-ray mode without the x-ray target in place, giving patients massive doses of radiation.

In one case a patient was diagnosed with skin cancer on the side of his face. The technician put in a prescription to the Therac–25 for a low dose of electron radiation, and enabled the beam. The patient screamed, as the machine made a loud sound that was heard over the intercom. Further, the Therac–25 showed an error message, 'Malfunction 54,' and then stopped.

The tech ran into the room, asked what had happened, and the patient said that he saw a bright flash of light and his face felt like it was on fire. When the tech reported the error message to the manufacturer, the manufacturer said that Malfunction 54 meant there was either an overdose, or underdose, and the software couldn’t figure out which.

It took some effort by the technician to reproduce the error condition, but if the prescription information was entered rapidly enough, the Malfunction 54 error condition could be reproduced on demand. When this was reported to the manufacturer, the manufacturer could eventually reproduce it, and measured the center of the beam to be 25,000 rads — about 2 orders of magnitude higher than the prescription.

The patient died about 3 weeks after the treatment. Autopsy records showed the patient to have had significant radiation damage to the right front temporal lobe and to the brain stem.

Overall, 4 people died, and two more were injured seriously in 6 events from 1986 to 1987.

March 2002, Fort Drum, NY

In 2002, I worked for a large corporation, named Raytheon. And personally, I found it a fascinating place to work at the time. It was literally just a 6 month after the 9/11 attacks and the U.S. was on a war footing. It was an anxious time, and the far majority of US citizens were supportive of the president’s use of military force.

In Fort Drum that fateful day, the field artillery unit was training for war.


At that time, especially right after 9/11, there was a clear mission. Everyone knew what it was, and everyone was marching to the same beat. On the morning of September 11th, we all knew war was coming, and the program I was working on, AFATDS, would be used for the first time in a major combat scenario.

AFATDS was developed to do two things: increase the effectiveness of the artillery force, and prevent friendly fire. Friendly fire is one of the worst aspects of war. It’s one thing to have one side shooting at you, it’s another when you have both sides shooting at you.

In Fort Drum that fateful day, the field artillery unit was training for war, and the commanding officer was being demanding of his soldiers. He wanted a round fired immediately. The AFATDS operator chose a target on the AFATDS system. A screen came up with some details about where the target was. The commanding officer was getting impatient, and he wanted his artillery to fire now. The operator clicked on a window, confirming the details about the target, and the target was sent to the artillery gun.

A minute or so later, the round landed on a mess tent. One soldier was killed immediately. Another died of his injuries several weeks later. An additional 13 were injured.

In 2003, the Army cleared Raytheon and the AFATDS program from fault, and so I continued working there, blissfully ignorant of the true underlying details. It took until 2008 for the Fort Wayne Journal Gazette to publish the details of what actually happened.

As it turned out, the artillery's window altitude was set to 0 meters, which was the default if an elevation wasn’t provided with the target. But the altitude of the target was in fact 200 meters. For a trajectory, this meant the target could be off by more than 1000 meters.

AFATDS had opened a form for the operator, but instead of requiring the operator to input the elevation, the altitude was pre-filled out in the form to be 0. But 0 is a valid altitude. And if the operator clicked on the OK button, the software accepted the value.

It might have felt better had it been the first time this issue occurred — but it wasn’t. In 1998, four years earlier, the same thing happened at Fort Bragg, except they caught the problem before the rounds were fired.

Further, within a week of the accident, the bug was fixed.

After the dust raised by the newspaper article settled, I asked a test engineer about the design decision, and the decision to use 0 for the altitude, and he said, “When the operator sees that zero, that tells him to enter an altitude.” I was stunned, because I that was completely the wrong UI experience for the operator. Put simply, if the program needs something from the user, it needs to ask for it.

Lessons Learned

First, we need to realize when a software failure creates an unsafe condition. And in those cases, software development should shift from a trial and error style of development to using strong mathematical proofs of correctness — formal verification. Unfortunately, such systems are complex and expensive to implement, and they can’t be implemented overnight. But they do provide hard proof of a program’s correctness.

For every benefit that software gives us, there is also usually a failure case.


Second, we need to understand that the human interface is a layer of communication between the user and the programmer. If the communication is not clear, then the confusion caused by the interface can lead to an unsafe state. While these kinds of issues may be easier to fix, they require careful testing to make sure the people that use them can use them accurately, especially when under stress.

And for everything, we need to understand that software is not a magic bullet. For every benefit that software gives us, there is also usually a failure case.

Lastly, unfortunately, with software, we more often learn from our failures than we do our successes. And that will continue to be true for the foreseeable future.