Negative engineering and how DoD avoids risk by putting its head in the sand

Here’s an interesting article from Jeremiah Lowin, What is Negative Engineering? He gives an example where a company has a catastrophic failure, a flood of error messages at every layer of the stack. They work backwards for days and eventually find that an expired credit card cascaded errors throughout the system.

The problem here, therefore, wasn’t the fact that an error occurred. There will always be errors, even in the most sophisticated infrastructures. The real problem was how limited the team’s options were to address it. Faced with a critical business issue and a deceptive cause, they were forced to waste time, effort, and talent in an effort to make sure this one unexpected quirk wouldn’t rear its head again.

 

So, what would be a better solution? I think it’s something akin to risk management for code or, more succinctly, negative engineering…If positive engineering is taken to mean the day-to-day work that engineers do to deliver productive, expected outcomes, then negative engineering is the insurance that protects those outcomes by defending them from an infinity of possible failures.

The author says that engineers spend “up to an astounding 90 percent of their working hours” triaging negative engineering issues. This stems from writing more and more code while building up technical debt through tiny patches and band-aids. IBM’s Rebook in the 1990s that only 20 percent of code should be functional, the rest is error handling and resilience.

Certainly there’s some rhyme with defense programs. Development inevitably meets problems executing to the long list of requirements. But Operational Test & Evaluation is staring the engineers in the face, and the whole business incentive is to get through that on time to start getting the big production and sustainment dollars on the other side. The production money is already lined up, taking more time to do the development right or change the requirements could cascade into a major restructuring. So quick little fixes are done in development that create very high production and sustainment procedures. And the whole thing moves along, stretching the program out at a higher unit cost, and requiring tons of later modifications to make it work. The F-35 is a classic case study, where several hundred production units are not even combat capable without upgrades to Block IV.

Here’s more from the author:

But here’s the rub: The tasks associated with negative engineering often arise from outside the software’s primary purpose, or in relation to external systems: rate-limited APIs, malformed data, unexpected nulls, worker crashes, missing dependencies, queries that time out, version mismatches, missed schedules, and so on. In fact, since engineers almost always account for the most obvious sources of error in their own code, these problems are more likely to come from an unexpected or external source.

Again, this is exactly what afflicts defense programs. Even though the program manager is supposed to execute his baseline plan, and the contractors execute their proposal, unexpected problems will often arise from external sources. New commercial tech provides an opportunity, political factors lead to reduced funding, subcontractors refuse to cooperate, a specification turns out infeasible, user feedback demands change, the enemy threat has evolved… and on and on.

Yet the defense acquisition system puts its head in the sand, pretending immaculate multi-year plans will unfold as predicted.

Be the first to comment

Leave a Reply