Networking and the Internet’s ‘global warming’ problem

In this article I am proposing that there is a hidden ‘climate catastrophe’ built into the resource-guzzling model the Internet embodies. This ‘network warming’ is a form of statistical noise or ‘background hiss’ that prevents TCP/IP from functioning and causes increasing application failure over time.

I expect to get attacked for the heresy of suggesting that the Internet isn’t quite as good an idea as it’s cracked up to be. That’s OK. I have a good supply line of chocolate to keep my mood up!

A brief history of the networking universe

Back in the early 1970s, there was a lot of dispute over the ‘right’ way to build networks. Many competing ideas emerged. These today are broadly characterised by two schools of thought. The first is the ‘bell-head’ philosophy of ‘smart networks’ that ensures deterministic behaviour to meet specific performance through engineering. This contrasts with the ‘net-head’ belief in ‘stupid networks’ and emergent discovery of uses, and dealing with performance problems on an ad-hoc basis. Each was staunchly defended by their respective proponents (and both have their serious problems).

This debate continued through the 1980s and 1990s with OSI telecoms standards, and wars over connection-oriented vs connectionless packet networks. The telecoms industry attempted to unify the whole domain with technologies like ATM. As we now know, that largely failed, and the net-head view has been in the ascendency for some time. TCP/IP became the more virulent networking idea, and simply spread faster than any rivals.

The notable exception remains voice services, which either remain firmly attached to dedicated TDM networks, or to single-service IP networks that do their best to behave like TDM ones (but nobody wants to admit it).

The Internet is a cheap trick, albeit quite a good one

TCP/IP removes the need for flow control in the network, and substitutes for it for congestion control at the network edge. (For brevity I’m focusing on TCP and ignoring UDP and μTP – the underlying issues are the same.) What that means is that the network is not aware of the relationship between subsequent packets, nor does it have any concept of being ‘full’. Instead it moves each packet one-at-a-time in isolation, and discards packets when too ‘busy’ without telling anyone. This is the essence of the ‘stupid network’ idea and ideal.

TCP’s congestion control approach attempts to build protocols that are sufficiently co-operative as to allow a sharing of the network resource without the network resource itself having an active part in that process (beyond dropping packets when queues get too long, for some arbitrary and unspecified value of ‘too long’).

TCP then uses the detection of packet loss within one flow as a proxy for saturation of the network, and as a signal for that flow to back-off the rate of transmission. Significantly, this means putting a control loop at the edge of the network between the sender and receiver. This control loop responds to conditions that earlier packets in the same flow (and no other flow) experienced previously in transiting the network. Note that this intrinsically involves a time delay: “slow down now, because the network resource was saturated earlier”. The assumption was that flows would always somehow ‘find their level’ as they competed for resources.

The benefit of this approach was a minimal coupling between applications and networks, and a maximal opportunity for TCP/IP to use any and all connectivity to spread further. The absence of admission control, and associated gatekeepers and charging mechanisms, took TCP/IP outside the paradigm of the settlement system of the established telecoms players. It didn’t need to play by their rule book. By the time the Internet was established as a commercial phenomenon, it was too late for telcos to stop it.

However, as is the case with all cheap tricks, even the good ones, there is a price to pay. Thus far the price has been hidden by rising speeds, and smarter applications. However, it can’t be deferred forever, and it’s a form of pollution that is going to be difficult and expensive to fix.

The problem of edge congestion control

Those control loops broke the most basic principle of control theory: you shouldn’t attempt to use a control loop to manage a phenomenon that happens faster than the control loop can operate. That’s because bad things can happen when you do. How bad? Well, let’s see…

Today’s Internet is very different from the ARPANET of the early 1970s. We no longer just have Unix remote terminals and file-transfer. Now we have everything from two-way real-time video to P2P file sharing to offline backups of hard drives. The diversity of quality and cost needs is very great. The needs of society increasingly will rest on it working, and staying working.

That growth is application types means every queue in the Internet is seeing a greater number of concurrent flows. Furthermore, they do not conform to simple and smooth statistical arrival rate models. Indeed, some have strong pulsed phasing, like Netflix or iPlayer waking up every second to buffer more video.

So to recap: more flows, more variation in need, and more variation in waveforms of arrival.

However– and this is the really critical bit – we are also seeing a decreased isolationbetween flows, both intra-user and inter-user. Once you might have had hundred (or even thousands) of dial-up modems attached to one multiplexing point. Now you have a cabinet in the street for ten homes, and then those cabinets get aggregated together in small numbers. Our ability to noticeably out-compete our neighbours keeps growing.

The ability for one user or device to send traffic that disturbs the flow of others by causing temporary saturation is rising.

This is exacerbated by how TCP/IP interacts with queues at high loads. These queues make packets condense and clump together into in a ‘hailstone effect’ as the network saturates. (The details of how this happens deserve an essay of their own.) Furthermore, the TCP control packets themselves tend to get lost as the network saturates, so the ‘slow down!’ messages get lost, and the trains of packets get even clumpier.

So to recap again: more flows, but less isolation between flows, and TCP/IP has a need to get the control information back quickly and reliably in order to work, but that is being sabotaged by being transmitted via the same medium as the very contention it is trying to signal.

The result: flow collapse

The basic premise of the TCP/IP control loop is that the ‘network weather’ in future (‘how humid/saturated will it be’) corresponds to the information it is receiving from the past (‘we lost a packet!’) and that those signals from the past will arrive promptly and reliably.

This wish is sabotaged by the combination of the above:

  • Dropping isolation between users resulting in greater inter-user flow contention.
  • Increased number of flows per device and devices per access point, meaning rising intra-user contention.
  • Increased correlation between flows so you get a ‘millennium bridge’ phasing effect
  • Increased number of flows with ‘aggressive’ pulsed traffic patterns that disrupt smoother flows
  • A design flaw in (nearly) all queues that causes phasing and correlation to increase rather than dissipate
  • When all these happen together, the network collapses in a ‘flash crash’ and becomes overwhelmed by failure load traffic of delayed and retried packets.

    By putting the control loop as wide as possible, the Internet is architected to have theleast possible level of stability under load.

    It’s a bit like deciding whether to take an umbrella with you based on what the weather was like earlier. It only works if you wait a short while, and the weather doesn’t change very fast. Break either assumption, and you get saturated, soaked and sick. It’s even worse if you have messenger servants who go out to check the weather for you at the destination you want to go to, and have a habit of stopping off for a pint or five of beer on the way back. You also find that everyone goes out right after it was sunny, and when it comes to rain you can’t get indoors quick enough due to the crowds.

    I’m pickin’ up bad vibrations

    TCP only works when loss and delay are relatively stationary. Lose that stationarity, and the cheap trick stops working. The Internet’s Achilles’ heel is a rising variability in loss and delay.

    This is more complex than just ‘jitter’ – it’s the rate at which loss and delay vary, and how they vary in tandem – not their absolute value – that causes disrupted control loops. When many TCP control loops oscillate in sync you get flow collapse and chaotic network behaviour patterns. You can upgrade to a faster network, but even if the loss and delay drop in absolute terms, should their rate of change increase, your flow throughput will fall. Fat pipes are not enough; contention effects can overwhelm bandwidth effects.

    Over time, the Internet is likely to works less well for loss and delay sensitive applications. The cause is a rising statistical noise that is akin to global warming. Its source is the ‘pollution’ of everything you do on the Internet, which is to contend with other traffic and disturb the control loops of all the other flows, due to lack of isolation.

    The Missouri question: “show me”

    All the car number plates in Missouri (where my daughter #1 was born) carry the sceptical nature of the local citizenry: ‘Show-me state’. The natural question to ask here is “where’s the data to support this contentious argument?”.

    Well, it’s a complex mix. There exists real data from real telcos where these effects occur (all of which is all under NDA); experience outside of telcos of how multiplexing and phasing effects don’t fit people’s intuition in the standard ‘bandwidth’ model; plus there is a ton of reasoning and maths; and some anecdotal supporting stories.

    Ultimately, I have confidence in it because TCP/IP does break the fundamentals of control theory in three critical ways:

    • It distributes control over loss and delay across space, rather than at a single point.
    • It distributes control over loss and delay across time, rather than at a single moment.
    • It treats networks as uni-dimensional (send/not send) rather than as two-dimensional trading spaces between loss and delay.
    • So it’s a question of when, not if, the Internet begins to show some unpleasant consequences of the design short-cuts taken decades ago.

      However, at the moment nobody knows when it goes from “getting better all the time” to “what went wrong?”. Nobody is collecting the relevant longitudinal data on a mass scale and analysing it in the right way. It’s a complex interaction between usage patterns, network architecture and improved transmission technology deployment.

      No get out of jail free card

      There are four ways forward:

      • A lot of temporary fixes to hide the problem. These will just defer and increase the size of future failure modes.
      • Wait and live with the failure. Just let the Internet become ‘reliably unreliable’, but with no knowledge of when and how it will fail. Will anybody in London be able to get useful work done during the Olympics when working from home? Who knows! (Cynics will ask: who cares!)
      • Build a single-service network for every application you really care about. This will become expensive, quickly. But network equipment vendors will love you.
      • Put some kind of flow control back into the network. This is what is being retro-fitted using policy boxes across the Internet, but in ways that don’t always solve the problem, sometimes make it worse, and are very costly done in a piecewise fashion.
      • The collapse effect cannot be solved just by adding more bandwidth, since you can’t engineer a network that goes from ‘fat’ to ‘fatter’ in every direction, and you can’t afford infinite capex and opex to contain the effect. (Beware – that won’t stop Cisco from selling it to you.) Furthermore, long round-trip times also force you to hit the TCP maximum window size, and that then limits throughput. Your streaming video still stutters, even if you have fibre, if rival applications inject packets in the right-wrong way.

        The traditional telco approach embodied in IMS and similar technologies is hopelessly wrong, but that’s another essay entirely, and one I have written too many times already.

        Or pick a ‘chance’ card

        There is an alternative, which is to use some radical new applied maths to manage your network completely differently, and fix all these problems. If you’d like to be the first to give it a try, please do get in touch. Enough said.

        Warning! The network maybe hot!

        The Internet’s ‘global warming’ problem is something that has long-term consequences for technology and policy.

        For technology, the need is to recognise that you can’t solve the problem by building better control loops. Indeed, the very issue is the existence of control loops with a timescale less that the round-trip time of the phenomenon being managed. I’m sorry, but your cheap trick has been exposed.

        The problems of bufferbloat in networks are secondary to the basic issue of broken control theory. You can’t solve the problem at the same level as the symptoms. You have to go up a logical level.

        We also need to progress from focusing on bandwidth exclusively to also incorporate ‘stationarity’ – so that where control loops exist, they have sufficient predictability to avoid chaotic flow breakdown. That implies different regulatory measurements, and different policy prescriptions, since the fastest networks aren’t necessarily the most fit-for-purpose ones. Regulations on network neutrality could have some extremely undesirable unintended consequences by stopping reasonable management to avoid flow breakdown and network collapse.

        Conclusion

        The long-term prognosis for the Internet is a kind of ‘thermodynamic death’, as it ignores the fundamental laws of networking. Nobody knows how quickly that will happen. However, being ‘capriciously unreliable’ is too costly for running society-critical applications. Over time we will have to rediscover the virtues of managing flows inside of networks, for the simple reason that the physics of networking insists on it.

        To keep up to date with the latest fresh thinking on telecommunication, please sign up for the Geddes newsletter