Why we need antifragile applications and polyservice networks

The radical idea of ‘antifragility’, proposed by the polymath scholar Nassim Taleb, has significant implications for telecommunications. Last December I tweeted a profound thought, originally expressed by my colleague Peter Thompson, who is CTO of Predictable Network Solutions Ltd. It joins the idea of antifragility back to packet networking: “A polyservice network enables the ‘optionality’ that every antifragile system requires.”

I would like to explore this deep idea, since it points the development of broadband networks in the precisely opposite direction to the way in which they are currently being built.

What is antifragility?

The essence of an antifragile system is relatively simple: the system gains strength (and hence longevity) from experiencing variability and disorder. Continued small stressors to the system create a ‘learned state’, which makes it more adaptive to future large stressors. As a consequence, large ‘tail risk’ shocks are no longer catastrophic.

This learning process requires ‘optionality’: the availability of choices when stressed. The constant environmental variability causes those entities that make bad choices to be ‘killed off’ early, leaving the fitter ones behind. The end result is a system with a sub-linear (decelerating) response to stress, and (at a meta level) predictable systemic outcomes that emerge out of the randomness of individual events.

Nature appears to be filled with antifragile systems. Bones and brains require use and stress to stay strong. The planet’s biosphere is dominated by antifragile feedback mechanisms that maintain stability at all scales. The few mass extinctions have all been driven by very extreme outlier stressors.

If this all sounds neo-Darwinian, it probably is: Taleb can be seen as a 21^st century intellectual descendant of Darwin, with a more rigorous mathematical toolkit. Taleb’s work, like Darwin’s, fundamentally challenges how we think about variability and its effect on systemic change and risk.

Fragility and robustness

Antifragile systems contrast with ‘fragile’ ones. In fragile systems, there is no strength gain from small stressors. This can be because there is no stress to begin with; or there is no optionality when presented with stress; or there is no mechanism to differentiate good responses to stress from bad ones. As a result, there is an accelerating response to stress, so catastrophe can and does happen at relatively moderate levels of stress. The strength and longevity of such systems is low.

Taleb emphasises that ‘antifragile’ and ‘robust’ are distinct and disjoint ideas. ‘Antifragile’ makes a feature out of the inevitability of stress; ‘robustness’ treats it as damage which has to be expensively mitigated. Each has its place. Individual plane wings are robust, because they are strengthened with costly titanium bars. Aviation safety as a system is antifragile, because every crash is fully investigated, and thus contributes to the future safety of flying. Making the whole aviation system robust against crashes, however, would be an impossible and unaffordable goal.

An example: the antifragile traveller

As we’re talking about flying, let’s consider the recent Christmas Eve power outage at London’s Gatwick airport. To borrow from broadband terminology, this tipped their ‘passenger-based statistical multiplexing’ system outside of its ‘predictable region of operation’ into chaos. How would this affect different kinds of traveller? [Inspiration source here, and worth an hour of your life to watch.]

The anxious Ms Fragile, Valium in her handbag, goes into a frenzy. Based on her previous travel hiccups, she has an inner fantasy that all travel is capricious and intrinsically persecutory. This unplanned and disturbing event merely confirms her belief, so she heads straight home and vows never to travel by air again. A minor panic turns into a total catastrophe.

The well-travelled Mr Robust, Kindle in his laptop bag, goes into a stoical state in response to the power cut. He waits for the problem to pass over whilst reading a good (battery-backlit) book. If he gets too hungry or thirsty, he can always walk out of the airport, take an expensive Christmas Eve taxi home, and re-schedule the travel for another date.

The adventurous Mrs Antifragile, Kendal Mint Cake in her backpack, sees this as an opportunity. It is one that she couldn’t have hoped for, or anticipated, or even bought. She is unattached to original plan of flying off, and can stay in adequate comfort in the darkened airport. So she calmly pulls out a camera, to observe and document the situation for a possible future book on adventure travel. She also makes a mental note always to charge the batteries before leaving home, and not wait until she gets to her destination.

The moral of the story is that our past experiences, and our responses to them, condition our ability to cope with future stressors. Their equipment gave them optionality, their attitude the ability to exploit it, and their past learning contributed to both.

Today’s broadband networks are fragile

The fragile nature of broadband networks today is a result of the lack of optionality of the mechanisms being used, and subsequent failure of applications to learn appropriate responses to small stresses that would protect them from serious harm caused by large ones.

Why is the optionality missing? The inherent choices the network offers as a resource ‘trading space’ are being suppressed. We have a ‘pipe’ mental model, and pipes just ‘pump’. They don’t give you choices, other than ‘how fast?’.

Hence in this default ‘monoservice’ model of broadband today, there is (by definition) only a single class of service on offer to applications. This reserves all the choices at short timescales to the network operator. Applications do not have visibility of the full range of resource allocation choices, and (good and bad) alternatives. Worse, the drive for maximum headline speeds at the least cost has the unintended side-effect of reducing optionality, making sudden collapses more likely. Why so?

In so-called ‘best effort’ broadband, all applications simultaneously attempt to grab the best quality ‘slice’ of the network. They are effectively all attempting denial of service attacks on each other to get their job done. Broadband users are predators, who all prey on their neighbours.

Incentives matter, and these are not good ones. No application developer or user faces any blowback for contributing to rivals’ application failures, or catastrophic network collapses. This competition for resources results in a lot of variability and stress to applications. Higher speeds lead to more variability, and hence stress.

This form of stress is technically called ‘non-stationarity’. It is a shift in the probability distribution of packet loss and delay. Applications either fail in response to the stress, or purportedly ‘adaptive’ protocols and codecs attempt to shed or raise load in response to changing resource conditions. (The reality to the end user is ‘degradation’, not ‘adaptation’.)

The nature of non-stationarity means these adaptive protocols can and must make bad ‘guesses’ as to what load to apply, since non-stationarity by definition means the past ceases to be a guide to the future. The small stressors also suffer from a ‘condensation effect’, which makes for a non-linear accelerating failure under load.

This is precisely the ‘concave’ accelerating response curve that Taleb warns us so strongly against.

Robustness is not the answer

As applications are offered no real optionality by the network, there can be little or no learning. Prior stress events do nothing to help the applications and the network make better future resource allocation decisions. Yet failures still occur, and users want them to be mitigated.

The current approach to managing application failure is to increase robustness. The failure is caused by contention for shared resources, and isolation is restored with separate physical access, or statically-allocated logical resources. This doesn’t scale, because it requires allocation to the peak of all applications, whose cost is unsustainable. Worse, it causes ‘superfragility’ if demand ever exceeds that fixed supply limit.

In fact, according to Taleb, we are doing the worst thing possible: adding capacity to monoservice networks (‘debt load’) in order to reduce the effects of non-stationarity stress (‘boom and bust’). This makes extreme ‘black swan’ type failures much more likely: ‘network stability crises’ and ‘investor confidence crises’, just like the 2008 financial crisis.

So given that we’re heading down a path to statistical ruin, what can we do differently?

Polyservicism introduces optionality

In a true polyservice network, the ‘quality’ on offer is separated out into multiple classes that are exposed to the applications. They have little or no overlap in quality, to avoid the ‘quality inversion’ effect. There must be at least three classes (economy, standard and superior), which are differentially priced to reflect their cost. Crucially, the ‘economy’ and ‘superior’ classes allow for a high level of choice between price and performance.

A polyservice network pushes the key resource allocation decisions back onto the applications. Applications request a ‘quantity of quality’, and the more they are willing to time-shift their demand (by using a lower class), the less the application provider or user pays. This re-creates the optionality, since applications have choices to make and (good or bad) alternatives to select from. In this model, networks resemble options trading platforms, not pipes.

The ‘options’ being traded are the right to offer a load to the network, and a specific future claim on the resources as a result of offering that load. Today’s (monoservice) contract is ‘offer any load you like, whenever you like, and anything might happen as a result’. In contrast, a polyservice network offers many kinds of ‘options contracts’, and brokers the different demands to allocate the resources to where they create most value.

But how do we know what are the ‘right’ resource trades to make? This is an emergent side-effect of the incentives created by the operation of antifragile applications on a polyservice network.

How antifragile applications learn from stress

When there is stress and resources are scarce, an application has two available strategies: it can either modify its behaviour to become more cooperative with other users and applications, or it can pay a ‘penalty’ to promote its traffic to a higher class to out-compete rivals and maintain its performance. If it ‘pays for promotion’, it could be because the developer is being ‘lazy’, or alternatively it could be because the application is critically important and very inflexible in its demands.

This behaviour is different from today’s ‘adaptive’ applications in a crucial way: there has to be a pre-existing ‘options contract’ to be able to draw upon. (Repeat: networks are really options trading platforms). Thus the application adaptivity is not done dynamically at short timescales, since it is the job of the network to perform the ‘high-frequency trades’. Rather, adaptivity is done at medium to long timescales. True cooperation between applications is (only) possible at longer timescales, because the control system interactions must work faster than the effects being managed.

Antifragility optimises for long-term systemic stability

This process of ‘options trading’ engenders stability at all those longer timescales. Think of it like this: the strategies you use to visit the ATM to keep cash in your wallet, and those you use to avoid an overdraft each month before payday, are not the same ones you need to plan for possible serious illness, or to make provision for your retirement. A highly advanced polyservice network can offer long-term assured ‘options contracts’, such as ‘I want to do overnight backups every night for the next 3 years, starting midnight, for 1Gb of data, with 95% completion rate before 4am’. This gives network planners the information they need to accurately size the network, with appropriate slack to deal with peak loads and uncertainty.

Another way to think of this antifragile adaptive behaviour is like walking to the office without a raincoat, and then it comes to rain. The ‘adaptive’ behaviour on offer today gives you a choice of catching pneumonia, or sleeping in the office; any other action is too late to avoid the harm. The ‘robust’ alternative means you have to carry a coat in all weathers. The ‘antifragile’ behaviour is leaving home knowing that you’ve left an umbrella at the office, in case you need it for your journey home, because you got soaked once before and don’t want it to happen again.

Antifragile applications need a polyservice network

In a polyservice network, applications now have to compete for resources on the basis of their ability to select good long-run strategies over which classes to select for their traffic. Too low, and users abandon the application due to poor performance. Too high, and the application becomes uneconomic. The Darwinian (Talebian?) process weeds out the fragile and stupid strategies.

However, the playing field has also been tilted in a more fundamental way. Applications now have a strong incentive to alter their behaviour too, in favour of long-term cooperation in resource sharing. ‘Lazy’ application developers get competed away, as do ‘greedy’ ones. The end result is a networking system where the applications have been trained to cope with failure; fail in an order that makes sense as an emergent result of their normal operation; and only claim the resources they really need.

These ideas are not arcane academic issues of narrow minority interest. They strike to the heart of the economic viability of the telecommunications business. We are building networks with the low returns from fragile customer experiences, yet with the cost structure of robustness. Only an antifragile approach, using polyservice ‘option trading’ technology, can get us the best of all worlds: a rational incentive structure and sustainable economic model.