Podcast: Mission resilience with Trey Herr and Simon Handler

Trey Herr and Simon Handler from the Atlantic Council’s Cyber Statecraft Initiative joined me on the Acquisition Talk podcast to discuss how the Department of Defense can improve the mission resilience of its systems. The three pillars of resilience are robustness, responsiveness, and adaptability. In that description, resilience is more than about responding to adversity, but capitalizing on opportunity. Oversight agencies should take note that that adherence to plan is nowhere in that definition. During the episode, we discuss:

  • How mission resilience metrics differ from CMMC
  • The costs of excessive classification to security
  • How Netflix uses the chaos monkey to find failure modes
  • Comparing CIA’s Corona satellite development to that of F-35 ALIS
  • How the BattleLab idea can increase recombinatorial innovation

During the episode, we dive into a recent paper Trey and Simon wrote in conjunction with folks from MIT Lincoln Labs and Boston Cybernetics called “How do you fix a flying computer? Seeking resilience in software-intensive mission systems.” They recommend a new Center of Excellence for Mission Resilience in the DoD. The purpose would not be to duplicate cybersecurity initiatives, but rather to create metrics which can be put on contract to better verify that firms are using modern development processes like DevSecOps.

In order to have adequate status, such a Center of Excellence require a Senate-confirmed position, a dedicated budget account, and quick access to the DepSecDef. But ultimately, it shouldn’t be a Top Secret project creating DoD-unique rules and processes. Instead, the Center should adopt the thought leadership from the commercial and academic sectors as to what makes organizations resilient.

Metrics for Resilience

As a continuous process for deploying software to production, DevSecOps allows organizations to adapt far more quickly than traditional deployments measured on the other of months. While successful implementation of DevSecOps improves system resilience, it has often been used as a branding term in proposals where the contractors have not undertaken the significant change in organizational structure required. Here’s Trey:

And we’ve absolutely heard complaints from some of our partners that they see a lot of discussion and a lot of phrasing and a lot of framing. And then things are delivered to them in a quarterly waterfall and nobody ever talks to the user.

While there are a number of good metrics out there, many of them depend on the organization or project. In order to be useful for contract requirements, the Center of Excellence for Mission Resilience would have to do some refining:

One measure of CICD [continuous integration, continuous delivery] adherence is the number of commits you do on a code base in a day. That’s interesting, but it’s pretty raw. And when we talk about different kinds of programs, different levels of sophistication, that may not be a great measure.

 

And so part of what we’re looking for that that center of excellence to do is to actually define ways to measure some of these constructs and that’s leveraging work, being done in academia and in industry and,  the FFRDC and national lab community.

My general view is that in order for a metric to be useful, the analyst has to have a decent understanding in which it was generated, including the team, tools, and project. Perhaps there are rules of thumb for making the translations, but I’m doubtful they can be reported out and rank-ordered by the business manager from afar. I’ll be interested to see what comes of it and whether we start seeing anything on translation.

Chaos Engineering

The way DoD usually tests weapon systems is by first defining a Test and Evaluation Master Plan that outlines the criteria by which the system will be tested over the lifecycle. The role of test and evaluation is then to see how well the developmental test articles met the program’s pre-planned list of objectives before fielding.

Netflix, however, takes a somewhat different approach where it subjects its production system to a variety of stresses at the extremes of what can be expected to occur. As Trey explained:

It give them the opportunity to see that system operating under unique failure modes and unusual conditions as a way of learning about not only your own organization, but actually the system that you’re trying to maintain.

And here’s Simon with some evidence of the results:

There was a big AWS outage at one of their data centers in a few years ago. I think it was in 2015. And Netflix didn’t experience much if any service related interruptions because they had gone through this chaos engineering process that they took those lessons to to overcome that outage.

Listen to the whole episode for more.

Development Styles

Here’s a brief way of describing the failed development of the F-35’s logistics information system ALIS from what characterizes successful projects:

The intensity with which that development effort has pursued a specific defined outcome, as opposed to the kind of rapid experimentation and messy attempts to satisfy a user community that we profile in the report.

Trey continues:

When we’re talking about, it’s not that there’s no outcome, right? We’re not asking DOD to start spending hundreds of billions of dollars on a journey without a, a destination. It’s it would be impractical at best. I think what we’re really doing is to echo a couple of voices… saying the way that we’re acquiring systems right now, where all that I care about is the satisfaction of a requirements sheet, irrespective of costs, irrespective of usability, irrespective of security and flexibility down the line, is not a good model. It’s a somewhat hyperbolic way of describing the current acquisitions process. But unfortunately it’s not as far off as it should be with a lot of the systems that we’re talking about

Thanks Trey and Simon!

I’d like to thank Trey Herr and Simon Handler for coming on the podcast. Be sure to read their paper, “How do you fix a flying computer?” and find out more about their Cyber Statecraft Initiative. Here’s Trey on the Supply Chain Security show and at USENIX Enigma 2021. Here’s an article from Simon on Questioning basic assumptions in the cyber domain and watch him on C-Span discussing the future of the NATO alliance. Follow Cyber Statecraft and Simon on Twitter @CyberStatecraft and @SimonPHandler.

Full-Text Transcripts

Be the first to comment

Leave a Reply