The limitations of bandwidth

I have previously shared the idea that there is a new category of network architecture, the Network of Probabilities. This differs from classical circuits (Network of Promises) or best-effort packet data (Network of Possibilities). I personally believe it’s the next revolution in telecoms. What’s new is that it provides a trading space for allocating contention between flows, and does this with some novel applied mathematics.

A bit like the progression from 2G to 3G to 4G wireless data encoding, better mathematics can squeeze a lot more out of fixed networks. Indeed, we could say there is an equivalent generational progression in multiplexed fixed networks, from TDM to ATM to IP. In this newsletter, I’d like to lead you a little further along my own journey of enlightenment to the fourth generation of fixed networking, called Contention Management. It’s feeling lonely out here right now.

The hard thing to do is to let go of your intuitive beliefs about ‘bandwidth’ in networks. Packet networks do not have ‘bandwidth’, just like the sun is not made of ‘shine’. We mustn’t mistake metaphors for reality. Indeed, in order to get more out of networks, we must transcend the approximation of bandwidth-based thinking to network reality, and adopt a more robust model that is inclusive of both quantity and quality effects.

When better bandwidth is bad

Here are three examples of what happens in real networks when you apply naïve bandwidth thinking to packet networks like the Internet.

Example 1: Your network is working fine, has lots of bandwidth available, but the users keep reporting short outages and poor bandwidth. What’s going on?

Example 2: Image you have a standard 20 Mbit/sec DSL line from a central exchange to your home. One day, your telco comes along and ‘upgrades’ you. Now you have a 1 Gbit/sec fibre to a street cabinet, and then say 50 Mbit/sec copper onwards to your home. The fibre is fast, and your copper loop is shorter, so bandwidth goes up. But customers are complaining, and you notice that your online gaming has worse performance than before. What’s going on?

Example 3: To speed up application performance to your holiday cottage, you bond together two links, say 3G and a satellite link. Bandwidth goes up. When you test what happens to the applications, you find terrible performance problems. What’s going on?

Buffers badly batter bandwidth

Let’s see what really happens in networks.

The first is a well-known phenomenon called bufferbloat. When networks saturate, it disrupts the control loops that TCP uses to say ‘faster!’ and ‘slower!’ to the end points of the flows. This can lead to all the queues filling up, multiple packets getting lost in a row, and sudden collapses in transmission speed that are experienced at transient outages by users. The network recovers, but only slowly. As fast memory has become cheaper, the queues in routers have become longer, driven by the mistaken belief that it is always better to delay a packet than to drop it. This just makes the collapse bigger, and recovery slower. And more bandwidth makes the collapses more sudden.

When the telco upgraded from a single copper loop to fibre plus copper, it inserted an extra queue. This added new delay effects that undid all the benefits of additional bandwidth for delay-sensitive applications. Furthermore, it allowed ‘greedy’ applications to stuff that queue with pulsed traffic, which raised loss and delay for better-behaved applications. Hence customer experience got worse, despite more ‘bandwidth’.

If you take two network links and bond them, you can run into trouble in multiple ways. For a start, you have done nothing to improve the delay characteristics of the new ‘synthetic’ combined link. If you fire packets randomly down one or the other link, you get order-reversal, which TCP treats as a loss, and slows down. If you send packets from the same flow down the same link, the still self-contend, but the application may face unexpectedly different characteristics for each flow. So the different loss and delay for audio and video of a Skype call may seriously confuse the encoding algorithm. Furthermore, any outage or transient saturation effect, even momentary, may cause odd oscillations in the traffic that create poor user experience.

As you can see, the explanations require looking at the properties of the queues over short time periods; none of the problems were as a result of a lack of bandwidth. These aren’t isolated edge cases, as the problems are endemic.

Networks all have failure modes.

The questions are: how big are they are? how to manage them? and at what cost?

Consider contention before bandwidth

We get failure modes as a result of a lack or misallocation of resources to perform the functions the user desires. The fundamental resource of a network is not bandwidth, but rather contention. Contention is what happens when a packet sits around waiting for other packets, or has nowhere to sit and is lost.

We name this composite of loss and delay as quality attenuation. It’s analogous to noise in a transmission link, but is defined as a meaningful concept for multiplexed networks instead. There is an algebra of how loss and delay of packet flows can be decomposed, and this is not the place to describe it. Just accept there’s a nice formula to define and describe quality attenuation.

Now we’re in a position to move packet networking from alchemy to chemistry by laying down some basic principles.

The three laws of networking

Rather like the laws of thermodynamics, there are three fundamental laws of networks.

    • Quality attenuation exists: Statistically-multiplexed networks are systems with three parameters (load, loss and delay) and two degrees of freedom (typically loss and delay, with load being exogenous). A network is a bit like a piston with pressure, volume and temperature. Set any two, and the third is set for you.
    • Quality attenuation is conserved: Loss and delay are conserved. In other words, we can’t un-delay a queued packet, or un-lose a dropped one. This conservation works in two ways: attenuation is conserved along any one path, and also at any piece of equipment.
    • Quality attenuation is tradable: There is a trading space for loss and delay. At every edge node and transmission link they can be exchanged without increasing overall quality attenuation. However, any other form of trading – between different places, or at different times – inevitably does increase it.
    • These are not opinions, but are provable matters of mathematical fact.

      The trouble with telecom

      The telecoms industry is in trouble, because it is fighting all three fundamental laws. Needless to say, in a fight between management and mathematics, the latter always wins.

    • Telcos have no idea where quality attenuation is happening. They don’t look for it, don’t model it right when they do, and don’t know what to do when they see quality problems. All too often, the prescription is ‘more bandwidth’. It’s the equivalent of medieval leeches, but attached to your capex budget.
    • Telcos try to put quality back into the system when it’s already lost. Quality is a bit like darkness or quiet. You can’t go out and get a box of dark to make it un-light, or a spray can of quiet for a noisy place. Likewise, you can’t put quality back in once you’ve lost it. Techniques to hide poor quality, like anti-jitter buffers, just raise quality attenuation. Network compression boxes add more queues and quality attenuation. Content delivery networks have unexpected quality side-effects. Application-layer cleverness to adapt to quality issues just sets off oscillations that push networks into overdrive failure modes.
    • Telcos trade quality badly. They can’t sell quality of service, because they don’t know how to make it. Priority is about giving ‘more of the bandwidth’ to an application, as opposed to trading loss and delay between flows. The problem is, when you use priority, you typically give too much quality attenuation to the prioritised flows. Given the law of conservation, you’ve got less to give to other flows. What typically happens is that the non-priority flows enter failure modes and collapse easily. So telco QoS doesn’t work at any sensible cost.
    • This is not the network you are looking for

      It’s worse than you think.

      Telcos all over the world are splurging capex on unnecessary network upgrades to paper over what are often quality issues. So the first thing they do is build out fast, fat pipes, and sell them.

      And sell them. And sell them. They are then over-selling the capacity, in the mistaken belief that they sell bandwidth. But you run out of quality a long time before you run out of bandwidth. Applications collapse and customers complain – and the network doesn’t yet appear to be ‘full’. That was never in the business case. Are you sure this is a safe utility stock still?

      And when you do try to explicitly package and sell quality, to mitigate the collapse effects, you get an effect called ‘quality inversion’. It’s cheaper for customers to buy a fatter, faster pipe with lower packet service times than to buy the ‘quality-assured’ one. That’s a by-product of seeing quality through a bandwidth lens, and mispricing it as a result.

      Bye-bye to bandwidth

      The bandwidth approach has no means for modelling or managing the failure modes of multiplexed networks. Indeed, it takes infinite bandwidth at infinite cost to have no failure modes. The contention model lets you manage the failures, at a finite cost. Sounds like a good alternative, no?

      In a world where capex is constrained, and demand is not, we’re going to see an inevitable shift towards getting more out of what we have. The financial and network maths tells us we must manage the true fundamental resources of the network, not fantasy ones.

      At the end of the day, there’s no contention. Bandwidth is bust.

      To keep up to date with the latest fresh thinking on telecommunication, please sign up for the Geddes newsletter