The struggle to make software-defined quality of service (SD-QoS)

Nearly decade ago, I was busy using the Telco 2.0 platform to preach visions (mixed in with occasional hallucinations) of strategic telecoms futures. I explained [PDF] how the the industry needed to “slice and dice” its network resources up into different quantities and qualities.

The purpose behind this chopping and cutting is to assign the right “quantity of quality” to different uses under software control. This allows transmission resources to be used more intensively, whilst offering the appropriate performance to users. Together these raise carrier profitability.

This vision has partly come to pass with technologies like software-defined networking (SDN) and SD-WAN. The limitation of these is that they are, if you will allow me to exaggerate only a little, entirely tied to a “quantity” resource model of networks, namely “bandwidth”. “Quality” is notable by its absence.

Being always wary of emerging hype (especially mine), my industry colleague Dean Bubley at the time reported a quip from an associate of his. “The telecoms industry has only two problems with a business model for quality: telcos don’t know how to sell it, and customers don’t know how to buy it.” Regrettably, that was (and still is) only too true.

Yet it turns out that the situation is even worse than that, although it took me another half a decade to realise how bad it is. Not only does the telecom industry not have an established means to buy and sell quality, it doesn’t really even know how to make it!

(For the visually-minded reader, now is a good moment to picture a face with an “OMG! What a horror!” look on it.)

And even if it could make it, it doesn’t yet have a robust scientific standard to measure it. And even if it could measure it, it remains troublesome how to price it. In the world of telecoms, if you can’t find a billable event for something, then it doesn’t exist.

This would all be great and untroublesome if quality was unimportant, and quantitywas all that mattered. Sadly, quality is the one thing that the telecoms industry makes that nobody else can, by offering timely information delivery.

You can get a huge quantity of information to anywhere on the planet at low cost with a hard drive in a FedEx box. The drawback is that the “package delay” is measured in hours and days. Telcos only exist to outpace the post. That light-fast timeliness is synonymous with quality.

So let’s go on a meandering odyssey together into the world of quality of service (QoS) to understand why ‘software-defined quality of service’ (SD-QoS) is going to be a hot topic in the very near future.

In the Cambrian period of the datacoms industry (which takes us from around the 1960s to early 1990s) we had what you might call “hardware-defined quality”. Lots of “big iron” equipment was designed by clever PhD engineers working to rigorous models developed by bespectacled postdoctoral mathematicians. (You can sometimes tell this type of over-serious person, as their idea of a drinking game is daring to have a coffee after 3pm.)

Industry standards bodies and equipment vendors pushed out the blueprints for TDM and ATM switches. These were flogged to nationalised telcos who still had hardcore engineering departments. (Apparently there was a time before telcos only did vendor management and marketing. I missed that party due to being underage.) These network engineers had the tools and techniques to get quality under good control.

The defining characteristic of these older networks was (and still is) the ability to create a circuit. In the archetypical case of a phone call, it might take a few seconds to align all the resource ‘time slots’ along some path. Once that is done, they are dedicated only to you.

To give those resources to someone else would take a few more seconds of signalling and reconfiguration. You can think of this as being “low-frequency trading” for network resources, since the frequency of change was at very human timescales.

Then in the early 1970s someone had the whizzy idea of statistically sharing the network resources using packet data. You could pack a lot more data into the same transmission resource by turning it into datagrams.

The Jurassic datacoms technology of TCP/IP had come to life, and became mainstream from the early 1990s. This brought a slew of new issues to the industry, since quality was the result of a lot of random things interacting in rather hard-to-model ways. (You can always blame the French if American technological pride is being hurt.)

With packet data we moved to what you might call “software-undefined quality”. Software algorithms in routers could jiggle about the order of the packets, and schedule them in any way you could think of. This is “high-frequency trading” for network resources.

However, there was no clear and obvious way to guarantee the resulting quality. You generally set things up, turned the network on, and fiddled until it kind of did the right thing. If in doubt, run the network idle to avoid contention. Indeed, skip straight past the doubt stage, and overprovision everything like crazy all the time right from the get-go.

By coincidence, the mathematicians of the era grew more beards and drank more beer. Make of this fabricated fact what you will.

The difference between “low-frequency” and “high-frequency” trades is what my colleague Dr Neil Davies describes as timescales that are “elastic” vs “ballistic”. At “ballistic” timescales you have to make resource allocation choices without recourse to anything external to the network (or even network element). There just isn’t time to go outside and ask!

At “elastic” timescales, the opposite is true. For example, a switch for a phone call can go and do a variety of external database lookups to decide where to route the call. This is normal for, say, doing forwarding for number portability or roaming.

In the distant past, the way TDM and ATM were engineered gave you predictable end-to-end quality. TDM effectively avoided managing “high-frequency trades” entirely, and ATM kept them under tight control. In both cases loss was, by default, a fault.

Internet Protocol, on the other hand, as typically configured and deployed (i.e. “best effort”) gives unpredictable quality. Loss is intrinsic to its operation, and is a feature. This is important, as we shall soon see.

Now we’ve laid the groundwork we can have a meaningful discussion about quality of service. So why is ‘quality’ so darned hard to get control of in packet networks? Why can’t we yet make software-defined quality?

The answer is basically that we’ve botched the science, maths, engineering and technology. Other than that, it’s all OK, with the possible exceptions of the economics, regulation, and marketing of quality. Anyhow, I digress. Let’s take these subjects in turn (and this is only a small selection of the issues):

  • Science: We’ve not captured the basic things that are always true about the world (i.e. the invariants of quality). Our quality models don’t reflect reality very well, as they have too much “junk and infidelity”Don’t even ask about the metrics we’ve picked.
  • Maths: We don’t have an algebra of quality with the right compositional properties, or that also takes appropriate account of loss. We’ve failed to describe the resource “trading space” correctly, especially for the high-frequency trades and where packet loss happens.
  • Engineering: We haven’t gotten the requirements modelling language or a means of refining a requirement into concrete execution. We’ve become dependent on an anti-engineering ethos of “purpose for fitness”: build stuff and then see what it’s useful for.
  • Technology: We inappropriately chose work conserving queues (big mistake). We co-mingle degrees of freedom (i.e. can’t trade between loss and delay properly). Oh, and pretty much everything in Internet Protocol is a pessimal design [PDF] when it comes to controlling quality. Have a nice day.

To put it another way, when it comes to quality on packet data, we’ve not yet got a reliable general means adopted by industry to:

  • capture the customer’s application or service performance requirement, and
  • describe that to the network as a “quantity of quality” requirement for every element, and then
  • execute that requirement to known tolerance for failure.

This is, I think you’ll agree, somewhat disappointing. So when it comes to delivering predictable quality on broadband, it’s all a bit rotten and rubbish today, I’m afraid to say.

This wasn’t such a big deal until recently. Over-allocation of resources absolutely everywhere was cheap, due to the one-time capacity leap to fiber-optic cables and DWDM. You could also set things up, and once everything was working, put a lock on the cabinet door and insist nobody touches the dials. Demand just didn’t change that quickly, so supply hasn’t needed to adjust itself continually.

As we’ve moved to packet data, we’ve done OK in the network core. There is huge statistical aggregation and smoothing of the “high-frequency trades”. We can reuse many of the concepts and techniques from the circuit past, and the quality of the network is generally stable and consistent. SDN and SD-WAN are really old technological wines in new virtualised flasks that can cope with this kind of environment.

That’s all gone a bit wrong in the Holocene datacoms era with the move to software-defined everything, which now also extends to the network edge. It is at the edge where we have highly disaggregated flows, lots of contention, and buckets of packet loss. And lots and lots of high costs that telcos want to reduce.

We now want to stick centralised resource controllers in networks, and put all of the resource trades at all timescales under some kind of orchestration. The local exchange “cabinet door” has been opened, albeit virtually. These controllers need to coordinate the “ballistic” with the “elastic” network properties, so as to deliver very specific customer experience outcomes.

That means we’re facing the really tough challenge we’ve so far avoided: to get the core and edge to work together to deliver an end-to-end outcome. This means fixing the science, maths, engineering and technology. Plus the economics, regulation and marketing. This isn’t going to be quick.

Whilst there are a few of us pioneering new ways of doing things, the mainstream is going to be left floundering for a long time. There already is an unpleasant gap between the performance aspirations of customers and our collective ability to deliver, for things like streaming video, VR and unified comms. This is likely to grow as we move to an “Internet of everything” and all businesses “go digital”.

The only way of raising our collective game is to engage in a somewhat difficult transformation. We all need to move from “hardware-defined quality” and “software-undefined quality” to a new engineering model.

My past bet was on “slice and dice”. That became SDN, although it only “slices” by quantity, doing half the job. My new bet for the next decade is on “software-defined quality”, which will be sold as SD-QoS. That’s the missing “dice” part.

For the latest fresh thinking on telecommunications, please sign up for the free Geddes newsletter.