The remarkable story of Future Combat Systems

Not many people can legitimately claim to have helped their client to make a $270bn cost saving.

The remarkable story of the Future Combat Systems (FCS) project is an important one in the history of data networking. It was the trigger for the development of the new science of network performance.Until now, the story has gone untold. I have interviewed Fred Hammond and Dr Neil Davies of Predictable Network Solutions about their experiences in the period 2003-05.

What was the purpose of FCS?

FCS was a $300bn initiative sponsored by the US Department of Defense (DoD), with Boeing as the prime contractor. The aim was to enable soldiers in the field to be well-informed when making combat decisions. In the new generation of warfare, combat isn’t any longer on a battlefield, but instead is in urban landscapes and villages. Soldiers continue to need the same kind of information as on a battlefield, but now on an individual basis.

This is a challenging networking environment, with very short-term and rapid network deployments. You start with nothing, and yet within 48 hours you must have a complete communications infrastructure established. Three days later, it is all gone. Compare that to your local phone company, which can’t provision a new line in under six weeks!

You also don’t have a fixed battlefield, so that infrastructure is constantly changing: from inside buildings to outside, and from stationary to moving. In this ad-hoc environment, soldiers need to have both constant connectivity and access to a set of key services.

The network is also potentially under attack by an enemy, so failure is expected. A “graceful degradation schedule” describes the order in which things must fail. Voice was the last to go. Indeed, losing the word “don’t” in the phrase “don’t shoot!” was not allowed to happen.

Warfare is, by its very nature, a safety-of-life activity. The engineering requirement was to make the safety case for deployment.

Why did Boeing come to you?

As experts in safety-critical systems, we had been speaking at events on how to deliver predictable performance for networks in saturation, as was the case with FCS. We had also been having discussions with the US Navy on how they could have reliable offshore communications. Similarly, we had been talking to the Government Emergency Telephone System (GETS).

These conversations were all about maintaining critical infrastructure during times of crisis, and delivering services over a shared and constrained resource.

Boeing had been searching for industry experts to help them with an unsolved problem, namely how to make this safety case. We were approached by the management of the Lead Systems Integrator (LSI) team. This team was a joint operation of Boeing and SAIC, and was the internal group that represented the interests of the DoD.

The sensitivity and importance of the work meant we were required to become a Tier-1 subcontractor to Boeing. Whilst everyone else was a multinational and multi-billion dollar corporation, we were a team of three people. Yet we had to go through same set of bureaucratic hoops.

To the best of our knowledge, we were the only foreign nationals on a 6,000 person project. Special arms regulation approvals were required for security clearance. We had to set up a heavyweight secure teleworking system, with the key data staying in US. Everything had to be encrypted in flight in case any foreign governments took an undue interest in our work.

Why would Boeing go through such an onerous task to get us on board? Because it was that important, and they couldn’t find anyone in the US who could do the job.

What was the problem Boeing asked you to solve?

The key requirement was to create a system to reason about the FCS’s performance. This was to resolve a conflict between two groups of engineers.

The first group comprised systems engineers drawn from those who had built the space shuttle and international space station. They were really bright folk, many with PhDs in engineering. They were responsible for the overall system’s performance and fitness-for-purpose.

The other group was comprised of equally bright network engineers that were working on the Joint Tactical Radio System. This was a pre-specified, but (as yet) undeployed communications system. Their attitude was “we have the best science, we know what we’re doing, so don’t worry.” They had adopted a very “purpose-for-fitness” approach, even stating to an alarmed four-star general that “the network will give you what the network will give you.”

Unsurprisingly, this attitude made the systems engineers feel uneasy. Merely depending on the network “just working” didn’t make sense according to their discipline. Their intuition was that there was a high and unquantified risk.

The systems engineers didn’t feel confident they could deploy the applications and know they were going to work at all times when needed. Such a degree of unpredictability was (and is) unacceptable for a safety-of-life system.

This conflict had created a crisis situation. The design had to go through acceptance steps where they needed to make a reasonable case for its suitability. Knowing warfare is a dangerous game, they were willing to dial back the safety level, with one proviso. Whatever safety margin they picked, they had to look the DoD in the eye and say they had confidence in it.

But they couldn’t see how to get the information necessary to do that. They went looking for a network design which applied well-practised principles of engineering. What they found instead was, at best, black art.

The industry best practise then was “build it and simulate it for 6 months to see how it works and fails”. (It’s not improved since.) This involved a whole facility with a massive network simulator using custom hardware that swallowed tons of money. Worse, this simulator tried to reapply old techniques that had worked in the past on much more limited projects. This generated self-evidently false and misleading data.

The audit office was saying this didn’t stand up to scrutiny. A different approach was needed, one which was more science and less black art.

How did you go about tackling the science gap?

The systems engineering team had the professional ethos and confidence to learn, especially because they realised the degree of exposure that they faced. As a result, their ability to model “priority and precedence” failure modes advanced a great deal under our tutelage.

Our first step was to have a series of conversations with the key stakeholders. We quickly realised what was missing was a common language in which to discuss the problem domain of performance hazards. So we created one, and a corresponding training course to deliver it.

Having delivered this education, we sat down with a variety of teams, looking at different sets of problems. Our job as the “QoS experts” was to help them to quantify the demand requirement. This led us to develop the idea of a formal performance specification. What supply would minimally meet any given demand requirement?

Meanwhile, the network engineers were adamant that they didn’t want us to touch the network. They were very protective of their turf and not willing to engage in the same performance science process that the systems engineers embraced. Indeed, some of the meetings became highly confrontational – to the point where “home room monitors” were asked to attend. From our perspective, we were exposing the fact that the professional ground that they walked on, and thought was solid, was in reality merely quicksand.

Amongst ourselves we referred to this set of networking engineers as “The Sirens”: they sang beautiful technical songs, but were luring the FCS ship onto the rocks. They had the idea that they could just configure the network using diffservcodepoints, and the network would be all set. After all, Cisco had solved the quality of service problem, hadn’t it?

As we educated the systems engineers in performance science, as opposed to network engineering black art, this kind of “you’ll get what you get” perspective was failing miserably their “sniff test”.

At last: a science of performance

The systems engineers knew they were heading into uncharted territory in terms of network design. This led us to develop for them a calculus of network performance.

A mathematical language that captured each performance demand requirement as a “Quantitative Timeliness Agreement” (QTA). These probability density functions modelled how long things could take. We were then able to aggregate and map these into a supply requirement.

In every other engineering discipline this process is taken for granted, and would be part of a basic undergraduate course. However in packet networking, we were off the map of common knowledge!

This was a big step up from the initial approach of using a “bandwidth requirement”. Bandwidth is problematic because it ignores how long you have to wait for any resource. In fact, we realised that the same performance engineering reasoning we introduced for the network could also be applied to compute and storage resources.

As a result we created a general means of reasoning about supply and demand of distributed computing systems working in overload. How so? We started from our understanding of how to construct real-time systems for other safety-critical environments.

At long last, the systems engineers could enjoy the same rigour from computing and networking that they regarded as normal for aerospace.

So could they make the safety case?

We would learn things about the network off the record, and saw how they needed to make supply, demand and delivery changes. We understood how the overall system needed to behave in order to allow for the graceful degradation of their applications under load.

The problem was, we could never get close enough to the ad-hoc wireless network to effect any changes to it. There were such visceral negative reactions of the network engineers to outside interference.

What we were able to do was to unambiguously show that (for some of the use cases) the operational requirements of FCS could not be met, at least not on the infrastructure they were actually building. Within that constraint, the project was never going to succeed.

The systems engineers were now able to quantify their unease. They collectively began to say “there is no way this is going to work”. Enough people began to realise that the core performance integration risk had been exposed, and could not be mitigated.

How to save $270bn (Ts&Cs apply)

With hindsight, we can now see that FCS went to software-defined radios too early. To work at high throughput you also need long pipelines, and thus there is latency. You can’t coordinate things like a group of people opening fire together unless you can tightly manage that latency. As a result, we had become a bearer of bad news: incontrovertible evidence that FCS as it was evolving was infeasible.

When it came time for our contract renewal, there was high level pushback. We had exposed some of the FCS design’s fatal flaws and, as a result, had become persona non-grata to a sufficient number of decision makers. And, as is frequently the case in large corporations, our internal champions had been promoted or moved on to other projects – in this case, fully aware of where we predicted the FCS project was headed.

It took another year, but eventually FCS was cancelled, having spent $30bn of its $300bn budget. Our findings were not the only reason FCS didn’t move forward, but we suspect that they were a key one. There’s no doubt we contributed to saving the US taxpayer $270bn by preventing an infeasible project from being developed and deployed.

Trackbacks

  1. […] 9, 2015 By Martin Geddes This article follows on from “The remarkable story of Future Combat Systems”. FCS was the first project to apply the new science of network performance to a complex distributed […]

  2. […] is the third and final part of series of articles, the first two being The remarkable story of Future Combat Systems and Five key lessons from Future Combat Systems. In this article, Dr Neil Davies and Fred Hammond […]

  3. […] Future Combat Systems programme was infeasible. Saved US taxpayer >$250bn. (How? The story is here. It’s where ∆Q was […]

Speak Your Mind

*