Is it possible to take a set of individually unreliable units and form them into a system “with any arbitrarily high reliability”? Can we, in other words, build an organization that is more reliable than any of its parts?
The answer, mirabile dictu, is yes. In what is now a truly classical paper, Von Neumann demonstrated that it could be done by adding sufficient redundancy… the probability of failure in a system decreases exponentially as redundancy factors are increased. Increasing reliability in this manner, of course, raises the price to be paid and if fail-safe conditions are to be reached, the cost may be prohibitive. But an immediate corollary of the theorem eases this problem for it requires only arithmetic increases in redundancy to yield geometric increases in reliability. Costs may then be quite manageable.
That was the excellent Martin Landau’s classic paper: “Redundancy, Rationality, and the Problem of Duplication and Overlap.” What Landau is pointing to with high reliability organizations is the other side of the coin from O-Ring theory.
In O-Ring theory, named after the Challenger accident, there are a large number of single points of failure. All of them have to work right to succeed. On the Challenger, if the O-Ring fails then the whole shuttle explodes. On a production line, if workers are adding parts to a product, any one of them could potentially ruin the whole thing.
The likelihood of failure increases as the number of steps in the production process increases. So even if every step in the process is 99% effective, if there are 100 steps in the process then the likelihood that the whole will be successful is just 36.6% (or 0.99^100). In these highly interconnected processes where there is zero redundancy, you have to make each step very reliable if you want to have a good chance of succeeding.
Von Neumann’s original application was the transistor. If any single transistor failed, then it could cascade. You could either put a ton of effort into making sure zero transistors fail, or you could add a whole bunch of redundant transistors. There’s a reliability vs. redundancy tradeoff. Usually redundancy is far cheaper to meet a certain goal than perfect reliability.
Landau’s point was that in bureaucracies, the ideal is often to have zero duplication and overlap between organizations. Completely streamline everything so there’s no waste. This is exactly the point of McNamara’s Planning-Programming-Budgeting System. Expose every weapons project in DoD, spot duplication (e.g., Navy and Air Force building their own fighter aircraft), and then eliminate that duplication through a single-best solution. Landau says:
For the public administration rationalist, the optimal organization consists of units that are wholly compatible, precisely connected, fully determined, and, therefore, perfectly reliable.
The problem, as the O-Ring example makes clear, is that if you only have one project for every requirement, then every project will have to be super reliable. The effort to make every project team that successful, zero errors, is incredibly high. But by adding redundancy, you can afford not to over-staff and over-manage each project. The likelihood that two project teams will fail to deliver value is lower, and the likelihood of failure starts dropping precipitously as you increase to three, four, or more project teams.
Moreover, when you have overlapping projects, you can spot errors. Redundancy is an error detection and correction mechanism. Sounds crazy to say, but a single project won’t necessarily know whether it made an error. For example, could you have known that United Launch Alliance was producing over-priced rocket engines before SpaceX decided to build their own rocket that featured reusability? Without SpaceX’s Falcon 9, we would never have known that a launch to Low Earth Orbit could be had for under $100 million. DoD was paying over $400 million on average!
I’ll close with Landau. Consider this closely because it gets to the heart of innovation problems in DoD:
And self-organizing systems exhibit a degree of reliability that is so far superior to anything we can build as to prompt theorists to suggest “that the richly redundant networks of biological organism must have capabilities beyond anything our theories can yet explain.” In Von Neumann’s phrasing, they “contain the necessary arrangements to diagnose errors as they occur, to readjust the organism so as to minimize the effect of errors, and finally to correct or to block permanently the faulty component.”
As usual, Eric, I agree with everything you say right up to the point where you then blame in on PPBE. It’s simply not true that Macnamara’s approach is innately anti-redundancy. Macnamara gave us the nuclear triad, which is about as expensive as you can get in terms of redundant capability — because Systems Analysis recognized that redundancy was needed to achieve the level of assurance demanded by the threat of Soviet weapons. PPBE also got us that SpaceX rocket, so clearly that kind of competition (and redundancy) is possible within PPBE. Your own example of Navy and Air Force building their own fighter aircraft is something that has happened again and again under PPBE, so I’m not sure it’s an example that works in your favor…