Improving system resilience through falsification

In a previous post, I discussed system resilience and its relationship to risk. Here, I want to provide a way of building systems that are resilient to a broad range of risks, even those we don’t know exist yet.

Let’s start with an example. For developers of computer systems, security has been an integral part of operations. We’ve all heard about the damage that can be done to company brands if their enterprise is hacked. This only becomes more important when these companies are managing our personal data (e.g., social media), our personal safety (e.g., self-driving cars), or our national security (e.g., satcom).

In the early days for developers, a system hack was always a bad thing. The goal was to see zero hacks, ever. Through prediction and control of the systems functions, the developers hoped that they could anticipate and guard against all the vulnerabilities — the risks — to their system.

Today, we have ethical hacking. where companies and government agencies pay experts to try to hack their systems. If there are any vulnerabilities, they can be reported and closed. While there is a market for these “white-hat” hackers, some argue that they haven’t been used enough.

With ethical hacking, the expectation is that the systems have risks that can be exploited by hackers. It’s just that the system developers don’t themselves know how the risks will manifest. This is a different emphasis.

Another example comes from computer server design. Ten or twenty years ago, the best server designs were from the financial industry which vertically integrated the processes to maintain control. For the financial industry, they could not afford to have servers go down. The control also came at a high cost. Google, however, took a different approach under the expectation of failure:

… clusters of commodity server nodes are built that deliver services as a distributed system. In such a distributed system, failures of nodes are immanent and it is important to understand the mechanics of hardware failures to successfully prepare such events.

I think of this second example with servers as something closer to the principles of implementing Von Neumann redundancy. You add redundant elements which are expected to fail, and create management mechanisms to control the service. System reliability is created. This can be done because we have a good idea of the types of risks present to our system as well as their probability distributions.

In the first example, the general problem of system security from hackers, we encounter unquantified risks. More often we see operable systems that are more complicated than any one person comprehends. There are many millions of lines of code. There is risk in poor communication of overlapping knowledge between participants. There is also risk presented by changing technology, software updates, and an innovative hacker. The system developers may simply not know what the relevant risks to system security may come from.

I think where unquantified risks are present — “black swans” — we have to appeal to a higher level of analysis in order to improve resiliency. Our example, ethical hacking highlights two things.

First, we move our attention from the state of the system’s security itself — its architecture and vulnerabilities — to the broader organizational process with all its human interactions. Our computer systems don’t have one everlasting state, it changes with technologies and requirements and tastes. We must think about the developers creating the system and their processes. We must consider how the developers think about security. What are their hypotheses about vulnerabilities and how to they test them?

Second, we move our attention from the coextensive system of humans and software within an organization to an even broader level of analysis. When we introduce “white hat” hackers, we have moved to thinking about the market environment. We have “white hat” hackers which are themselves innovative, and they interact with a number of different organizations and their unique systems. It is through the interactions of the various players that we extend the perspectives. With that greater exploration of the alternatives, vulnerabilities are identified faster, their solutions are developed, information disperses across the network, and systems are made more resilient.

I think the most important part about moving to higher levels of analysis is that it allows for the introduction and propagation of a diversity of perspectives, many of which are conflicting. If the system developers had a complete concept of their system and its risks, then there would be no need for “white hat” hackers. It is precisely the reason that developers are themselves “too close” to the problem and with their own biases that “white hat” hackers can be useful. They have different sets of experiences and knowledge which leads them to be able to discover system vulnerabilities overlooked by others.

I think this principle is at play in science. We cannot say that a scientific theory is “true,” but when we have people always attempting to falsify a theory and they can’t seem to succeed, then we can give more credence to the theory as a scientific fact.

Similarly, we cannot say that our system is “resilient.” We can calculate resiliency for certain classes of risk. But for “unknown unknowns,” we can only try to falsify the system’s resiliency by realizing or simulating risks and seeing whether the system can withstand it. We have to continually try falsifying our hypothesis that the system is resilient, over and over. Often times, this requires appealing to a larger system — the community — that can take advantage of as many possible perspectives to falsify existing theories and generate new conjectures. These ideas have long been understood by thinkers like Hayek, Popper, Polanyi, and Alchian.

Armen Alchian applied these ideas to weapon systems acquisition. He found that when there was sufficient uncertainty in weapon systems development, you don’t try to design for more contingencies, you appeal to a higher level of analysis. You create a diversity of systems which explores alternative concepts, then you allow real world tests to falsify the poor designs and validate the good ones. More centralized decisions at the production decision keeps the force structure coherent.

The resiliency of the weapons acquisition process is not created by building in flexibility to uncertain environments in a single system. Resiliency comes from the larger system of conjecture, development, test, and selection, which allows for rapid response to changing environments all while still maintaining internal coherence.