Why Active Queue Management should worry telco investors

You may be interested to know that the IETF is pushing a technology that potentially undermines the economic basis for the Internet. This is called “Active Queue Management”.

It is the response to a technical problem whose (mis)diagnosis has been labelled ‘bufferbloat’. When long queues build up in routers, real-time and interactive applications are prone to failure. The superficial observation is that the buffers are too big. You can’t just shorten them, as that makes the packet loss rate rocket, and applications still fail. Hence you need to schedule the traffic differently.

Active Queue Management (AQM) is the proposed means of doing this, and there are a suite of competing AQM algorithms with obscure names. AQM continues a long Internet tradition: first we hit unanticipated failure modes, due to not having sorted out the theoretical foundations of networking at the very beginning. Then we heap hacks upon hacks to construct new success modes, whilst simultaneously arming new (and often more severe) failure hazards in the process. Repeat and rinse.

What AQM does is it treats the flows differently depending on their inferred need. Those which look like real-time flows (relatively sparse and constant bitrate) get better quality than those which look like file transfers (dense and elastic). Sounds like a great idea – Marxism for multiplexing. “To each flow according to its need (as divined by the central network planners).”

What this does in practise is sets up a huge quality arbitrage, which will inevitably be exploited. By giving the best quality to sparse flows, it creates an incentive to divide up bulkier flows into lots of sparse ones. If you want your web browser to whizz along with the quality of real-time traffic, keep on sharding every request. Faster streaming? Lots of sparse flows! Better P2P file sharing? Lots of sparse flows!

Since there is a finite amount of ‘good’ quality, this will no longer be reliably allocated to those flows that truly need it. Then Bad Things® happen. Rather than making real-time services work fabulously well, you arm a systemic collapse hazard for the network as it goes into overload. When you drive it outside of its predictable region of operation, chaotic behaviour patterns will take over. To state the obvious, this is not good for the customer experience. Customers will value the service less, pay less, and churn more.

Another way to think of AQM is that you have some wallpaper you are pasting onto a wall, and there is a bubble under it (the ‘bufferbloat’). You can’t get rid of the bubble, but you can smooth it out and move it around. AQM’s aim is for the bubble to move wherever you push hard; where you push lightly, the wallpaper adheres better. Should it succeed in this stated goal, it spreads the bubble out very evenly. If the bubble is big enough, the wallpaper stops adhering and falls off the wall. This is much worse than the original problem that you were trying to solve.

Apart from mispricing quality to create a risk of network collapse, AQM has three other serious inherent flaws:

  • You cannot measure performance easily. When you put (sparse) test data streams through the network, they will tell you nothing about what non-sparse flows are experiencing. But if you put a non-sparse flow into the network, you are disturbing the very system you are attempting to observe.
  • You cannot model performance easily. The mathematics of FIFO queues has been studied for decades. AQM algorithms have no such models, although they can be simulated. Those simulations are highly sensitive to the parameters of the network (e.g. link speeds, phasing between the load and its transport). Any small change to the network could cause a non-linear effect: graceless degradation under load.
  • You cannot manage performance easily. How do you tune AQM to deliver different kinds of outcome to different applications and users? What’s the relationship between the knobs and the resulting performance? Nobody can tell you apart from “run more dodgy simulations”.

Considering these together, they have another downside: deploying AQM means that there is no way of a network operator constructing assured services. These require an outcome-based contract to deliver a guaranteed quality of experience for an application. Yet there are no models that capture the performance hazards to help you manage the risks of breaching the SLA. Neither is there a measurement approach offered that would capture the actual service delivered over the range of loads being applied. This cuts operators off from where the future industry growth will be.

It is clear that AQM is a dud idea. Yet it will look like a good one at first, because people will show you all the great success modes they can construct… “Look at the low latency!”. This assumes no behavioural change to exploit the arbitrage, no driving into overload, and provides no model of the resulting hazards. Don’t fall for it.

If you are an investor in telecoms, you might want to ask some pointed questions on your next investor relations call. Why are operators planning to misprice their most valuable resource by giving it away for free to anyone who turns up with the right timing? Why are they configuring their networks to make them prone to sudden collapse under load? Why are they cutting themselves off from a key growth opportunity?

It’s easy to throw stones, and in the case of AQM there is a bucketful of them to lob. What should happen instead?

We have to let go of the fantasy that we can deliver success to everyone all of the time. It’s not possible in this universe. It certainly isn’t possible to create a single scheduling mechanism that satisfies all possible and diverse intentions simultaneously. That means managing failure in saturation, and having applications fail in the ‘right’ (or ‘good enough’) order.

The best (scientific) approach is as follows:

  • Start by understanding demand. What flows need what quantity of quality? When you can’t deliver it, what gives?
  • Next is to understand the resource trading space of supply. There are two degrees of freedom, and given a fixed load we need to trade the loss and delay to where it can be best tolerated.
  • Then price the options for delivery of ‘quantities of quality’ rationally. Let the market clear demand and supply for a finite resource.
  • Finally, we need to make the trades to operationally match supply to demand to make good on the options that were sold.

This is all a solved problem. The mathematics is done. The technology exists. If you’re interested in learning more, please get in touch.

For the latest fresh thinking on telecommunications, please sign up for the free Geddes newsletter.