All great truths begin as blasphemies. – George Bernard Shaw
I first met Dr Neil Davies and his crew from Predictable Network Solutions Ltd back in 2008. In the intervening years, my deepening understanding makes me believe that their ideas and technology are the single biggest paradigm shift I have witnessed in my entire technology career – one which now spans three full decades. I’d like to help you make that same journey of understanding.
It’s a journey worth making, because it shows us how the telecoms industry is misallocating tens of billions of dollars of capital, over-spending on network operational cost, and under-serving users through poor experiences.
The challenge I have is providing a path for others to also see how and why bandwidth-based thinking fails us, and contention-based thinking is a superior alternative. There are so many parts to the puzzle, so many things to un-learn, and so many counter-intuitive new principles to adopt. So if you’ll forgive me, I’ll pick a few simple highlights.
The essential problem is that the ‘pipes’ model you have in your head of packet data networks fails to match the fundamental reality of what goes on. That is because networks are not pipes, along which packets flow. Indeed, no packet has ever ‘flowed’ outside of the mind of a human. Networks don’t even have ‘bandwidth’ – that’s at best a property of individual transmission links. Instead, networks are large distributed supercomputers that take waiting packets and copy them. It sounds the same, but the combination of queues and copying makes for mind-warping results.
Applications need both quantity and quality
We want fat pipes, yes? Indeed, fatter pipes are better pipes, aren’t they?
Wrong! Here’s why. The most basic assumption of the bandwidth model in your head is wrong.
When you increase the speed of a network link, you are increasing the quantity of packets for delivery in a way that can degrade the quality that user applications experience. Indeed, more bandwidth can paradoxically make networks unusable some of the time. How come?
Well, imagine you have many users, with many devices, running many applications. Some of these applications will be sensitive to the quantity of packets, say a large file download. Others will be sensitive to the quality, and performance will drop when they experience bursts of jitter, loss and delay. This is typical of voice, video, interactive web applications and online gaming.
All these applications in turn start lots of connections. Some applications by their nature pulse traffic, and those pulses become correlated in time. For example, when you open a web page it typically initiates several simultaneous connections. These slam packets into the queues in the network, which fill up.
What happens next is that control loops like TCP detect packet loss, and try to slow down the rate of sending. And here’s the problem: those control loops can end up setting up a kind of “resonance” in the network, forcing ever more queues to fill up, especially as the ‘slow down!’ signals get lost too.
In other words, ordinary every-day traffic generates statistical patterns of flow that resemble low-bandwidth denial-of-service attacks.
Bandwidth is bad
The faster you make the network, the worse this phenomenon is, because the easier it becomes for ‘badly behaved’ applications with pulse-like traffic to crowd out all other traffic. You can get into trouble faster, but can’t recover faster. So your network collapses, and round-trip times can become anything from 500ms to 30 seconds. Yes, you read that right.
It reminds me of the old joke about slow postal services: When they charge 46p for a stamp, it’s only 10p for delivery, and 36p for unwanted storage on the way.
The underlying reality of a network is that it is like a microphone on a stage. When you increase the volume past a certain point you get nasty feedback effects. That takes you from a predictable region of operation into a chaotic one. This destroys the performance of your applications. It’s OK for your phone to buzz, but it’s not good news when your whole network rings.
This phenomenon is so counter-intuitive, it feels hard to believe. So why don’t we hear more about it? One reason is because nobody bothers to look. But in the real world, it happens all the time – especially in places like households with children or shared student accommodation, which tends to mix more and different types of traffic together. The bufferbloat phenomenon is just a special case of a problem endemic to all packet networks today.
It also doesn’t get noticed because at the time you upgrade bandwidth, temporarily traffic loads stay the same, so you tend to keep away from the unpredictable region of operation. But over time, it rises back up as the extra bandwidth (quantity) effects attracts increased number of users, devices and applications. This heterogenaity In turn generates more of that pulse-feedback effect, and you can end up with worse application performance than before you started.
So how to fix it? You need to think very differently about networks.
Think of trading, not transmission
Networks can be thought of as systems that trade space for time. By that, we mean they provide the illusion of collapsing the world to a single point, but at the cost of smearing the traffic with delay, and in extremis with loss. Stop thinking of networks as systems for transmitting data. Networks are systems for trading load, loss and delay. This worldview is not optional: that’s the fundamental, mathematical reality.
There are three basic reasons for networks behaving badly when we saturate queues with traffic:
- Trade-offs are done at different or inappropriate layers (including between sub-classes of traffic within layers).
- Trade-offs are dispersed physically across the network.
- Trade-offs are time-lagged.
The diagram below captures these three effects.
For instance, when TCP detects a packet loss, it assumes the network is over-loaded, and backs-off the rate of sending. This happens in isolation for one flow, and as a control loop works a thousand times slower than the phenomenon it is trying to manage (a single contended link).
No ‘get out of jail free’ card
Applications at the edge can’t adapt to transient effects that are momentary at far-distant places, or buried at layers of the stack they can’t access. The information about the effect cannot travel fast enough. This is a fundamental limit to the design idea that brought us the Internet in the first place, the ‘end to end principle’.
No amount of clever software at the edge can get you out of the problem of your network going from its predictable to chaotic behaviour patterns. That clever software can even induce new resonance effects and chaotic failure modes.
No amount of extra bandwidth can save you either. Indeed, that approach is going to drive the industry to bankruptcy. Bandwidth is not free, it takes energy, and we can’t afford the electricity bills for unlimited bandwidth.
Better than ‘best effort’
To escape from this problem, you need to make a simple but significant change: ideally do all the loss and delay trading at a single layer, place and time. In practise on real networks with multiple attachments to backbones, plus intermediate content delivery systems, you need to do it at 3-4 places along the end-to-end path. With some clever mathematics, you can make this process compositional, and control the end-to-end loss and delay. Then you don’t have the nasty delay-feedback loops, chaotic behaviour, and cost of upgrades using the failed bandwidth metaphor.
What we get is a fundamentally new category of network architecture.
We can think of old-fashioned circuit networks as the Network of Promises. You get a fixed loss and delay profile, and a guaranteed load limit. The downside is that there is no graceful degradation as the system saturates, just a sudden cliff as net traffic is rejected. Furthermore, all traffic must pay for premier first-class delivery, whether it needs it or not, and any idle capacity cannot be resold.
The Internet can be thought of as the Network of Possibilities. Nothing is guaranteed, bar the chaos at saturation, and counter-intuitive results from adding bandwidth in the wrong way. Failure to properly understand how loss and delay accrue leads to effects like in the example at the start of this essay.
The Network of Probabilities
We now have a third option that gives us the best of both worlds: the generativity of the Internet, plus the determinism of circuit networks.
Together, we’re going to build a better kind of Internet. The current prototype has done its job.
To keep up to date with the latest fresh thinking on telecommunication, please sign up for the Geddes newsletter