Problems with measures of effectiveness for test & evaluation

I would be perfectly willing operationally to test anything anybody can name without having measures of effectiveness or criteria for the test. I am also perfectly willing to use them, but they can easily produce the wrong answer if you over-concentrate on them.

Let’s look at a few examples. If we are talking about testing an eight-inch gun aboard ship, the kind of criteria that I will find, if I go to the specifications documents, are that it must fire 10 rounds a minute, or that it must fire 30,000 yards using full charges, or 22,000 yards using reduced charges. I don’t really are whether it meets these requirements or not. Eight rounds a minute or 12 rounds a minute isn’t going to make that big a difference. 30,000, 25,000, 35,000 – – these aren’t the important things to the operational tester. What he will use are the basic criteria by which gunnery has been measured over the years: Hits per gun per minute, early hits, who’s afloat after the engagement is over, that kind of thing.

In R&M (Reliability and Maintainability) the same thing applies. I simply can’t use technical measures. If you asked me to test an air-to-ground missile and tell me its MTBF (Mean Time Between Failures) should be 30 hours, I don’t even know where to start. Are you talking about hours that the missile is under power on the bench, or hours that it is aloft, or hours that it is aloft under power, or hours that it’s on the aircraft under power on the flight line during pre-flight and post-flight testing, or what. How do these various kinds of times relate to each other? — I don’t think anybody knows. Certainly, I don’t. I doubt if the Program Manager does. I doubt if the contractor does.

Is a missile under standby power on an aircraft pulling 4 G’s at MACH .9 stressed more or less heavily than one under full power undergoing testing at an air-conditioned test site? I don’t know what the relationship between these times is. But I don’t have to know, because I should be using an operational measure, like “80% of the missiles should be in a fully up condition after 10 captive flights of typical duration (catapult takeoff, 1.8 hours’ flight time at high and low altitudes, arrested landing).” This is an R&M criteria that means something to the decision-maker in the Pentagon and the Task Group Commander alike, and something that I can test to. So DT&E and OT&E thresholds are different.

That was RADM Robert R. Monroe, US Navy, Commander Operational Test and Evaluation Force, at the National Security Industrial Association Conference in Washington, D.C. 23 September 1976.

I think there is a lot of wisdom to this statement. The problem of metric selection is hard. If a product performs just a single function, then using effectiveness measures makes a lot of sense. But many systems perform a number of different functions under a number of alternative scenarios, and each measurement is incommensurable when stacked up to any other — i.e., there is no objective way to reconcile the value-scales of differing units.

Choosing a specific list of evaluation criteria ahead of development in the Test & Evaluation Master Plan may create a number of issues, including: (1) the development process may discover the need for tradeoffs between criteria, but there’s no means for judging relative importance or quickly amending the criteria; (2) there may be other factors not listed in the criteria that may have important impacts on mission sets; (3) creating a reluctance to introduce a system unless it meets all evaluation criteria, even if it performs better than existing equipment on a number of areas and users want it; and (4) new threats or opportunities may emerge that change the priority of the criteria.

Even though testers should use measures of effectiveness, it seems they should be subordinate to the collective judgment of experienced operators and testers. When they see a product, it can often inspire new ways of operating or creative tests that couldn’t have been foreseen upfront. And there may be something to be said of expanding on the chaos monkey concept.

Source: DEPARTMENT OF DEFENSE WEAPONS TESTING: CONSULTANTS, CONTRACTORS, AND POLICY. HEARING BEFORE THE SUBCOMMITTEE ON FEDERAL SERVICES’, POST OFFICE, AND CIVIL SERVICE OF THE COMMITTEE ON GOVERNMENTAL AFFAIRS, UNITED STATES SENATE, ONE HUNDRED FIRST CONGRESS, FIRST SESSION, JUNE 16, 1989.