Capacity is not king: schedulability is supreme
As mobile network markets mature there is a requirement to ameliorate costs. A natural step is to share infrastructure between mobile network operators (MNOs). However, that sharing brings with it new hazards, one of a family of hazards that has been emerging as 3GPP deployments have evolved. This article, co-written by myself and Dr Neil Davies, highlights the kinds of problems that are emerging.
A brief history of mobile supercomputing
Running a successful mobile network is about managing the operation of a large distributed supercomputer, one with multiple interacting control loops between the computation processes. Such networks deliver tens of millions of application outcomes per hour, spread across millions of concurrent users, where each outcome desires an acceptable quality of experience (QoE).
All of those successful outcomes are reliant on the timely movement of information (be it data or voice) between two end points. By looking back at the history of these systems and their emergence we can see how we are putting that success at increasing risk.
The GSM standards and initial network implementations were all designed and deployed for wired backhaul networks built using time-division multiplexed (TDM) circuits. Such circuits have well-defined properties and, given their telephonic history, were designed to deliver practically no variation in the transit time for the movement of the data between the structural network elements. This data was not yet packetized, but instead was transmitted as individual octets.
As volume requirements grew, the nature of the backhaul had to evolve. Asynchronous Transfer Mode (ATM) provided for these growing capacity requirements. As the name implies, it loosens the timing requirements compared to a the synchrony of TDM. Because of its historical roots, ATM provided mechanisms for minimising the variation in the transit time – now for full protocol packets – between the structural elements.
The emergence of new hazards
It was at this point that the nature of the hazard space changed. The particular hazard is that of schedulability. Not only do we have to care if there is enough capacity for the data, but also we must sequence the transmission so everything arrives in sufficient time. Schedulability is critical to successful outcomes, as there are timing constraints on the interactions between the computational processes within this distributed supercomputer. Breaking those constraints starts to affect delivered QoE, and breaking the wrong constraints affects service stability. Break the wrong constraints in the wrong way, and your network collapses.
In the TDM world all the schedulability risk firmly rested with the mobile network equipment manufacturer. The only difference between the test lab and the real world was propagation delay, and this delay changed simply with distance. If there was any hazard related to this delay it became immediately apparent at installation. The nature of TDM meant that, once mitigated, the hazard did not reappear.
In the ATM world a new schedulability risk arose. The protocol packets intermingled with other packets, both for other customers, as well as for other services. There was a resultant variation in their timing which did not exist with TDM. ATM’s designers knew that timing was important: after all, they wanted to run voice calls, and emulate TDM’s outcomes for their voice systems. They designed their networks to give assured levels of delay variation to different classes of service. To do this they built appropriate mechanisms into their switches, invested in control systems, and expended effort on how to configure their network.
These network engineers had a history of applying advanced mathematics to their problems, Indeed, they had even caused fundamental advances in that mathematics to occur as a result of their efforts. They lived in a world where if you broke the telephone service, then you lost your job, so the incentives to deliver good user outcomes were clear.
Got a problem? Don’t blame the network supplier!
The mobile network equipment manufacturers exploited those capabilities, and mapped different types of interactions within their distributed supercomputer onto different traffic classes. They also said, you must allocate the ATM network capacity to (at least) the peak of your interaction requirements. They specified these capacity planning rules because they didn’t want to take on any of associated schedulability risk. This risk arises when arriving packets exceed (even for the briefest time) the system’s ability to forward them.
As such, network equipment providers avoided having to deal with the ensuing delay and loss, and its impact on application outcomes and QoE. Their computational interactions were designed and built in the context of the ideal properties of the TDM network. The industry (or at least the non-technical management) had no apparent issue with this. As far as they were concerned the connectivity properties had not changed between TDM and ATM. After all, they even had the same product name: “circuits”. It is just that the ATM ones were now (somehow) “virtual”. In other words, they were not circuits, but we had yet to face up to that fact.
The magic pixie dust of Internet Protocol
Still the capacity requirements rose, and twice over: once for the extra application demands, and once again for the extra internal network overheads.
On top of this, a new world started to arise, founded first on Ethernet, and then on TCP/IP. These technologies have their true heritage in local area networking, and in the honourable art of tinkering around until something seems to work. In this environment there were only packets flying about, and neither assurances of delivery nor sequencing of arrival. The network forgot it was involved in delivering flows, and everything became about delivering individual packets, each without reference to overall outcomes or schedulability requirements.
They named this approach, with a wicked sense of irony, “best effort”.
These technology tinkerers were young, they were dynamic, they were well-funded, and they made it “just about work”. So they got rewarded, and moved onwards and upwards. These people lived in a world where you got promoted by delivering more bandwidth, not application outcomes, and they delivered bandwidth by the bucket load.
Local networks go global
This virulent 1970s lab experiment then started to spread outside local area networks, and it brought its heritage along with it. IP is an over-stretched network protocol creating a global LAN, not an internetwork protocol. The Internet does not have an inter-network layer, since it lacks gateways to isolate and abstract one network from the next. As a result, the Internet it is more like a concatenated “network of networks”: a cancerous overgrowth of local area networking.
With Internet Protocol, bandwidth is king. Just as in stories of old, that king has become endowed with mystical and magical properties. You can cure any and all ills through invoking that king’s magical powers.
Yet there was still a whiff of acknowledgement that these issues of scheduling had not yet disappeared from the kingdom of unlimited bandwidth. The black magic of QoS still existed, but it was very much a bit of face-paint you added to a warty packet to beautify its ugly outcome. This remedy was skin-deep, since it did not have behind it the control system or sizing rules on how to configure the network to achieve the desired outcomes.
Even speaking of application outcomes became heresy, and the concept of there being a need for any assurance was near treason. Bandwidth was to become the one true eternal king and rule forever. “Throw more bandwidth at everything” became the common mantra.
When bandwidth goes bad
Two of the important lessons of history started to become forgotten: everything worldly is finite, and science and mathematics are your best friends when you want to make changes in a predictable, low risk, way. Networking ceased to be a scientific process, and became a craftsman tinkerer industry. Rewards and awards came from ingenious new forms of network alchemy. Endless clever hacks were needed to fix emergent and unforeseen problems. (These problems, incidentally, are considered ‘fixed’ when you can tinker enough to get a network simulator to print the word ‘success’ at least once.)
In this kingdom of bandwidth, signs of remaining problems can always be assigned to malign external forces: stupid network operators misconfiguring equipment to cause bufferbloat; evil telco shareholders refusing to inject endless capital to increase bandwidth everywhere to solve all scheduling problems; and wicked regulators failing to ensure pure neutrality, just because these pesky scheduling issues won’t go away.
Meanwhile mobile network manufacturers were still on the hook to deliver something vaguely fit-for-some-purposes. So they re-jigged the way that they mapped their distributed super-computer interactions, and liberally applied the face paint of Internet Protocol packets. They started supporting systems where the wide area network was now founded on these young dynamic technologies. Who wouldn’t want to sprinkle their stock price with Internet Protocol’s magic pixie dust?
Equipment costs were low, and bandwidth was cheap – or at least it was, when it was all inside the same building on a local area network. What could possibly go wrong as we scaled this model up?
Not all fairy tales have happy endings
Attempts to banish the need for outcomes from the kingdom of bandwidth failed. Users will only pay for fit-for-purpose application experiences, and the bandwidth king needs subjects generating cash crops to feed his court and armies. That means the constraints of schedulability could not be ignored forever: the schedulability hazards had increased as we went from TDM to ATM to IP, as the inherent time-management capabilities of the underlying overall network decreased.
What could mobile equipment vendors and operators do in this situation? They applied the king’s mantra! “Throw more bandwidth at the problem”, they said, over and over.
So capacity requirements rose, now thrice over: once for the extra demands, once for the extra overheads, and once again for the schedulability.
We now can move on to the present day.
Return of the rule of schedulability
As the industry has gotten older, harder times have arrived. The world really is finite, and the land and serfs cannot sustain unlimited cash production. Consequently, capacity costs now need to be addressed. Endless infinite free bandwidth proved to be just a fairy tale, after all: even when it is “cheap”, a large amount of anything still costs a lot.
Issues of schedulability can no longer be ignored because of the way that its effects influence both those interactions within the distributed super-computer, as well as those of end user applications running over them. These issues of appropriate allocation of loss and delay, i.e. schedulability, now have to be faced.
However, another character now enters the story. The finite capital supply means that mobile network operators want to share their resources, and in particular they want to share their backhaul, which is now perceived as expensive. Furthermore they want to live in a nirvana where if operator A is not using “its share” of bandwidth, operator B can “make use of it”, and hence reduce costs. Yes, that is the sound of tinkering that you can hear in the background.
As you may have already guessed, there is a catch, since sharing doesn’t actually create any new capacity. What it does create is a new class of schedulability hazard. Let’s look at how the revenge of schedulability plays out.
It’s my network; no, it’s mine!
Imagine two MNOs, A and B, have a 50:50 share in a link. Operator B is not particularly busy, and operator A is using (at a particular instant) 65% of that link. Along come a few more customers of operator B (and if it is 4G backhaul, just one is enough). This customer starts using more capacity, pushing B up to its 50% of the link.
Operator A, who was using 65% of the nominal link capacity, now suddenly sees that become effectively halved. Now operator A is using 130% of its assigned capacity. And that means only one thing: packet delay and (at least) 30% loss.
How does that affect those distributed control loops between all the network components? What does it do to the delivered user QoE? Does it go beyond affecting a few users’ QoE, and start degrading the system’s stability? And if it does, where does the responsibility for fixing the issue lie?
Such concerns are not isolated to this one example issue. Timing interaction problems between network elements are pervasive, and network virtualisation only exacerbates them.
Everyone loses in the blame game
The network equipment manufacturer is going to say “it’s not our fault; you didn’t assign enough bandwidth”. They have covered themselves, since are still saying that you have to assign at least to peak. The backhaul supplier says “it’s not our fault; we delivered what we said we would”. They have covered themselves, since they did exactly what they were asked us to do.
That means the mobile network operator finds itself in the uncomfortable position of not being able to devolve the issue to its suppliers. It was a hazard that they had responsibility for all the way along, and now they find themselves in the position of having to face up to discharging that responsibility. The tragedy is that many, if not most, mobile network operators have lost the corporate skill and collective will to safely innovate in the operation of mobile networks. Too many have become hollowed-out marketing and vendor management organisations.
Remember that there were two things slowly being forgotten, back in the kingdom of bandwidth, filled with network tinkerers. The first was a sense of a finite world and limits to resources. Network operators increasingly grasp that reality. What they now desperately need is the second item: the science and the mathematics of scheduling.
Desperately seeking schedulability skills
However, such scheduling scientists are now very rare, as it had long been decreed that there was no need for them. The old ones have retired to and spend their afternoons down the bingo hall, in the company of stale statisticians and past-it philosophers. Nobody saw the need for new ones, so they weren’t trained up in the old-time techniques.
So if you see a network scheduling scientist in your vicinity, you would be wise to stop him and get him to share the wisdom that is fading – should he be offering to share the secrets of his trade.
If you would like help in combating the hazards associated with a shared mobile network infrastructure, please get in touch. We can help you identify and manage the performance risks.
To keep up to date with the latest fresh thinking on telecommunication, please sign up for the Geddes newsletter