Resilience, the other side of risk

We all want our systems to exhibit resiliency, whether they are organizations, airplanes, or our human bodies. Here’s a good definition of resilience:

… the ability of the system to react to perturbations, internal failures, and environmental events by absorbing the disturbance and/or reorganizing to maintain its functions.

Notice how this definition of resilience is intimately tied to risk. All the things we want our system to be resilient to are considered risks. If there were no risk, then there wouldn’t be the concept of resilience.

But the definition adds an important point. We don’t just want our systems to be robust to environmental changes. We want our system to reorganize its main functions to adapt to the environment. This brings in concepts of Darwin’s evolution, Bertalanffy’s equifinality, and Nassim Taleb’s anti-fragility.

Some background:

Before John von Neumman, most people thought about resilience in terms of better planning. In other words, the best way to produce resilient systems was to analyze the system, reduce it to its functioning parts, identify all instances of perturbations, and fix any problem areas.

Resilience was also about prediction and control. It was procedural. It assumed the following:

(1) Defined boundaries — the system can be isolated from its environment and “observed” from the outside.

(2) Linear interactions — we can understand all aspects of the system by an account of its parts.

Certainly, there are instances in the world that conform to these assumptions. Take an aircraft for example. We can pretty well define the airplane from the airspace around it. And we know exactly how its components work. We know how much turbulence an airframe can manage, or how much debris the engine can handle. Resilience can then be achieved by measuring the risks involved and engineering the parts to withstand them.

When our system has these properties, then the reductionist account of prediction and control of the parts is a pretty good answer to resilience. But sometimes, that resilience comes at a severe cost. Sometimes you can’t create a component that has the necessary performance to overcome 100% of the known environmental risk, or it is prohibitively costly to do so.

John von Neumann considered how you could create a system that is more reliable than the parts. One particular issue was with transistors, which were known to fail at a certain rate. He produced a theory of redundancy, that the reliability of the system could be insured through sufficient redundancy of component parts. This 1956 paper set the foundation for high reliability engineering (note: Claude Shannon had reached a similar understanding of redundancy in his still under-rated 1948 paper on Information Theory).

Great! If we know the functioning of the system’s parts and we can measure external risks, we have a sure-fire way to build in resilience. We can create systems that would be 100% reliable by adding redundancy. Sure, we lose efficiency because of the cost of maintaining and accessing spare parts, but overall the system is resilient.

But notice this engineering framework does not meet the definition of resilience above. It does not “reorganize” itself to maintain its functions. In fact, it requires a more advanced management and logistics system to handle the redundancy, which themselves become a point of vulnerability to failure! Von Neumann recognized the same, when he noticed how biological systems

… contain the necessary arrangements to diagnose errors as they occur, to readjust the organism so as to minimize the effect of errors, and finally to correct or to block permanently the faulty component.

This kind of resilience is more complicated than redundancy of parts alone.

So what does this have to do with risks?

The larger view of resilience is that the system can absorb unanticipated shocks, fail gracefully, and learn to adapt and overcome those shocks — even anticipating new risks.

The basic point comes from Hume’s problem of induction, which was picked up and expanded on by Nassim Taleb. Briefly, we often speak of risk as if we know with 100% certainty the probability distribution and the size of impacts, we just don’t know which one of those outcomes we will get. For example, we know a fair coin will land on heads 50% of the time. But what about “black swans”, or events which we have little or no data on from which to generate a probability distribution? And its not just their probability, but their impact.

What happens with “radical uncertainty”? What about the things we didn’t know we didn’t know about? (For example, I know that I don’t know the Spanish language, and I know that I don’t know how an iPhone is built, but I’m sure if I tried to learn how to build an iPhone, I’ll discover all sorts of science and technology that I simply didn’t know existed… but somebody knew about them. And we can be sure that there is “valid” science that no one knows about yet, just as we can be sure there are risks present to our system that no one knows about. We cannot “map” the risks to our systems in a complex world.)

The question is then, how do these risks manifest in our system? Well, perhaps there is nothing you can do about them for systems with defined boundaries and linear interactions. But those assumptions were only valid for a small class of systems anyway that were never intended to spontaneously adapt, grow, and learn. In fact, we might not want our airplanes “learning” how to fly and behaving unpredictably. So these systems will always be relatively fragile.

But we can appeal to higher levels of analysis. We want the air-traffic system itself to behave more like a complex adaptive system, perhaps, and be able to identify errors and update aircraft to insure the passenger’s safety. In order to generate self-organizing behavior, these systems must break the assumptions of defined boundaries and linear interactions.

(1) “Fuzzy” boundaries.

Complex systems cannot be perfectly isolated from their environments. For example, we may know an airplane from the airspace. However, it is difficult to determine the boundaries of a project team within an organization, or the organization from the broader political-economy. The same is true of all natural systems. Physicist David Bohm found that

The notion of a separate organism is clearly an abstraction, as is also its boundary. Underlying all this is unbroken wholeness even though our civilization has developed in such a way as to strongly emphasize the separation into parts.

The systems then co-evolve with their environments over long time periods. That it was makes them resilient. They’ve either survived or were filtered out of the system. It’s not completely possible to take yourself “outside” of the system and test it for errors, to predict which specification would lead to survival. Of course you want to do that to the extent possible. But the environments which give rise to errors may not have presented themselves yet. It cannot be predicted and controlled for like in a nice experiment.

(2) System interactions are nonlinear.

The parts of systems necessary for the kind of resilience we are talking about — reorganization of functions to meet changing environments — are not homogeneous and additive. Small effects in one area can create outsized effects throughout the system.

Nonlinear mechanisms generally propagate through feedback loops. Three feedback loops include:

(1) iterative feedback, where outputs are routed back to inputs (which is present in all natural phenomena, such as nucleotides encoding proteins which in turn encode nucleotides);

(2) Downward causation, where emergent behavior at higher levels constraints lower level behavior, such as the convection flow affecting fluid or valence conditions downwardly affecting electrons; and

(3) Backward causation, where expectations about the future impact actions today, most obvious in asset markets.

Because of feedback loops and other resonant phenomena, prediction in nonlinear systems is impossible. If you cannot predict how the system will react in any given case, then you cannot say you know the risks involved. A minor fluctuation may give rise to a “butterfly effect.”

No one person understands the system’s complete functioning.

This is mostly a consequence of the previous two issues. If the system can absorb shocks from unforeseen risks and organize its functioning around them, then there is no definite procedure for developing it. That means we cannot necessarily use prediction and control in engineering. It also means that the redundancy point isn’t so helpful, because our employment of redundancy is largely dependent on our ability to calculate the probability of failure.

But, redundancy continues to be important, if in a different way. For example, W. Ross Ashby argued that internal regulation of complex systems must have a “requisite variety” of mechanisms to deal with an environment characterized by continual flow and change. As environmental challenges grow, the system needs to achieve a larger number of stable states to cope. Such variety requires a large the number of parts and numerous paths of communication. Donella Meadows explained how

Resilience arises from a rich structure of many feedback loops that can work in different ways to restore a system even after a large perturbation. A single balancing loop brings a system stock back to its desired state. Resilience is provided by several such loops, operating through different mechanisms, at different time scales, and with redundancy—one kicking in if another one fails.

Similarly, Ludwig Bertalanffy described the idea of equifinality. Any outcome in an open system can have multiple nonlinear pathways of causality, creating resilience. By contrast, outcomes in closed systems have only one a single path of cause-and-effect. And I’ll leave you with a Nassim Taleb quote:

Redundancy equals insurance… The organism with the largest number of secondary uses is the one that will gain the most from environmental randomness and epistemic opacity!

So redundancy for complex systems cannot employ something simply akin to spare parts and a single system of management and logistics. Secondary uses implies different causal pathways, new ways of doing things, and innovation between redundant features. They are distinct and overlapping, substitutes and complements. That isn’t something designed from the start, but something that emerges in development.

Conclusion.

I think we do not yet know how to build resilient systems in the adaptive sense. But firms clearly recognize the need for agility and speed. One way of thinking about it is whether Machine Learning can, by analyzing historical data, identify and fix errors. Another way of thinking about it is that human organizations are now coextensive of the products they create. I think the latter is reflected in “Agile” software development, and does not preclude use of the former.

If we can’t achieve resilience through prediction and control of a particular process, then we need to move to a higher level of analysis. What is the larger set of processes, and how do we build resilience there in order to achieve our ends? This moves our attention from the individual to the population, from the project to the organization, and from the product to the value-chain.

In future posts, I’ll discuss how human systems identify errors and how competition is necessary for discovering knowledge. The critical point is that redundancy is also the exploration of alternative perspectives.