Now that we know some of the causes, what exactly was the software crisis? It is a period of time marked by bungled software projects, whose common problems included:

  • Projects that ran over-budget
  • Projects that ran over-time
  • Software that made inefficient use of calculations and memory
  • Software was of low quality
  • Software that failed to meet the requirements it was developed to meet
  • Projects that became unmanageable and code difficult to maintain
  • Software that never finished development

Let’s review some representative examples.

The IBM OS/360 Project (1963-1965)

This ambitious project was undertaken by IBM to create a single operating system that would run on all mainframes manufactured by IBM. While operating systems that can run on different models of computers are commonplace today, at one time this was a very novel idea. IBM was offering half a dozen mainframe products at the time, and each had is own, incompatible, operating systems. On April 7th, 1964 IBM announced a new lineup of products - 6 models of computers, with 44 peripherals, and all were to be compatible with each other! The new OS/360 would make this feat possible.

But the software development staff struggled to accomplish the central goal of OS/360 - the ability to run more than one program at a time, and other issues began to crop up. IBM responded by hiring 1,000 more developers to assist the project, spending more in a single year than had been planned for the entire product development process! The operating system was eventually shipped, but a month late and far over budget.1

The Therac-25 (1985-1987)

Perhaps the most chilling software failure in the history of Computer Science is the Therac-25, a machine to deliver radiation therapy to cancer patients. The software used was repurposed from the earlier Therac-6 and Therac-20 model, which had hardware interlocks to prevent certain kinds of failure conditions. The Therac-25 removed these in favor of software-based safeties. A race condition in the controller software, along with poor user interface design and assorted other issues could lead to incorrectly delivered doses of radiation - as much as 100 times the intended dose. Between 1985 and 1987 at least six patients were exposed to dangerous doses of radiation and developed symptoms of radiation sickness; three of these patients died 2.

The Denver Airport Baggage System (1995)

The Denver International Airport Baggage System was designed to be the most advanced baggage system in the world, automating the handling of baggage throughout the airport. The project was a spectacular failure, leaving the airport inoperable for 16 months after the rest of the airport was ready, running $560 million dollars over budget, and the final system achieved only a fraction of the intended functionality. 3

The Ariane 5 Rocket (1996)

The European Space Agency’s unmanned Ariane 5 Rocket exploded just seconds after launch. This was the sad culmination of a decade-long project costing $7 billion. An investigation into the failure showed that the issue was an integer overrun error in the control software that occurred when a 64-bit floating point was converted into a 16-bit signed integer - the resulting integer value was too large to be represented by a 16-bit integer, leading to a crash of the primary system. The backup system encountered the same problem, leading to a destabilized flight and subsequent explosion. 4

The German Toll Collect system (2003)

Germany contracted with the Toll Collect, a syndicate involving large corporations including DaimlerChrysler, Deutsche Telekob, and Cofiroute, to establish an automated toll-collection system that would use GPS receivers mounted in trucks to collect Autobahn usage data and automatically bill truckers’ accounts for the usage. The project was slated to open in August 2003, but was eventually cancelled by the German government when the project failed to be ready by February 2004, after Toll Collect admitted the system would not likely be ready until 2006. Over 156 million euro of revenue was lost due to uncollected tolls before the contract was canceled. 5

The Healthcare.gov Launch (2003)

The Healthcare.gov marketplace was a cornerstone of the Affordable Care Act - a website where Americans without access to employer-based health insurance could enroll in a competitively-priced insurance plan. At launch the site was plagued with technical issues, and only an estimated 1% of its users were able to complete an enrollment, and many of these applications were failed for missing information.

FAA NextGen (2003-)

In 2003, Congress authorized the Federal Aviation Administration to replace the now 45 year old Air Traffic Control System used to control all flights over the US, Host. The new system, named, NextGen, is a modernized approach using GPS and able to hand-off control to multiple redundant systems. Its development has been plagued with issues, with many features pushed back five or more years and costs ballooning an estimated $2.6 billion over the initial projected budget. 6

US National Grid Gas Company (2012)

The New York based utility, National Grid, sought to replace its old Oracle system with a cloud-based system built using SAP technology, at an initial cost of $383.8 million. The resulting system had serious errors, including making incorrect payments to workers that culminated in lawsuits against the utility. National Grid had to hire and additional 450 contractors to fix the software flaws, and an additional 400 workers to manually perform the work the system was supposed to be doing while it was fixed. The additional costs were estimated to be an additional $561.30 million, a cost passed on to their customers. 7

Heartbleed (2014)

In 2014, the bug heartbleed was discovered in the open-source library OpenSSL, used to create secure connections across the internet. The bug allows any internet-connected adversary to read the memory of systems using unpatched OpenSSL-implementing software. OpenSSL is one of the most widely used TLS libraries, and is integrated into both Apache and nginx open-source server software, which together host over 66% of the web sites on the Internet. The bug was the result of a simple programming error in the library. 8

Meltdown and Spectre Exploits (2018)

The duo of Meltdown and Spectre are processor-based vulnerabilities found in 2018. The Meltdown attack exploits out-of-order execution, a performance enhancing feature of modern processors, to allow a malicious program to access memory outside of that assigned to it by the operating system - i.e. it can read memory belonging to other applications, and even the OS. Spectre exploits branch prediction and speculative execution, a feature of modern processors that allow them to “guess” what the program might need to do next and execute it ahead of time. The result is that some actions that would not have been carried out by the program are nonetheless executed; a spectre attack takes advantage of this to “trick” the processor into speculatively executing a properly-designed program in a way that leaves some data open for the exploit to read. Both are the results of optimizing features that break the traditional program execution model, and therefore the security expectations programs are designed against. 9

Boeing 737 Max MCAS (2018)

For their 737 Max model jetliner, Boeing increased the engine size on a 737 airframe to increase fuel efficiency. To fit the oversize engines on the wings without changing other aspects of the airframe (which would require significant retooling of their factories), they shifted these large engines forward, which changed the aerodynamic properties of the jet. This change could result in engines stalling if the nose of the plane was pitched too far forward. To prevent this, a computer system was added, the MCAS, which would bring the nose back down when the angle of attack was too steep, effectively cutting the pilot out of the control loop. The software on this system focused on the output of a single angle-of-attack sensor, rather than redundant sensors, and this interaction with faulty sensors led to two deadly crashes shortly after it began operations, and the eventual grounding of all Boeing 737 Max planes. 10