Over-provisioning bandwidth doesn’t solve QoE problems

The chart below is, I believe, one of the more important ones ever produced in the history of packet networking. It was generated by my colleagues at Predictable Network Solutions Ltd, based on measurements taken from Kent Public Service Network (KPSN), and reproduced with their kind permission. It is the first publishable direct comparison of standard network management utilisation data against a strong user quality of experience (QoE) proxy. This data is important because it contradicts a common belief about networking.

The widespread assumption is that as long as you have enough bandwidth, any remaining quality issues can be readily addressed by allowing some ‘slack’ excess capacity. This is called ‘over-provisioning’. Its presumed effectiveness is the bedrock of investment plans in 4G and fibre. The problem is that over-provisioning is both ineffective and inefficient.

This is invisible to network operators, because their metrics fail to offer the necessary QoE information. This in turn costs the telecoms industry an unknown (but very large) sum in wasted capital expenditure and churn. Our engagements with other communications service providers (CSPs) have shown this to be a common phenomenon. This one chart calls into question many of the basic assumptions on user value and return on capital of network planners and investors.

KPSN is one of the best-run networks in the UK, and delivers broadband services to schools, hospitals, council offices, etc.; it has around 250,000 individual users in Kent spread over 1,100 sites. KPSN costs (in our estimation) about 60%-80% less to run compared to an equivalent managed service from a major systems integrator. (How they achieve this feat will be a subject of a future article.)

As part of their expansion and growth plan, they asked my colleagues to predict the QoE and cost effect of adding further users and uses (e.g. voice) to the network. A key step was to establish a baseline for the current QoE on offer, and measure what performance hazards were armed, or had already matured into application failures.

The chart below compares two sets of measures for one 100mbps fibre link, in one of the collector arcs of the network, taken over a five day period.

Measure of performance hazards

Measure of performance hazards

The green line is the standard 5 minute average link utilisation data that every packet switch offers by default. The red crosses are a robust synthetic metric that represents the risk of failure a national call phone call, based on a ‘performance budget’ sub-divided and attributed to this link. (The measurement techniques required are described here.)

As you can see, there is a correlation between when the network is busy, and when QoE is at risk. However, there are two key things of interest in this data, highlighted below.

Realities of over provisioning

Realities of over provisioning – click image to enlarge

The first is that there are big spikes of usage (in the evenings), but the QoE of the user application is not at risk. That means any network planning rules that use that data would force inappropriate upgrades that just waste money for no user gain.

The other is that there are potential QoE failures at even extremely low average loads (under 0.01%, i.e. over-provisioned by a factor of 10,000). Hence over-provisioning is not an appropriate way of solving QoE issues that are due to poor scheduling choices. Operators are either throwing capacity and capital at these problems without resolving them; or users are having QoE failures that are not visible to operators, and are churning.

The telecoms industry is like the proverbial drunk searching for its lost customer experience keys under the street lamp. It’s easiest to look at the single point average bandwidth data the network gives you for free, not the strong QoE proxy data that requires multi-point measures.

Network performance measures

Network performance measures

The bottom line is that average bandwidth utilisation is a poor proxy for user application QoE. The industry is using network-centric metrics that fail to reflect to the customer experience, and are building and operating networks based on assumptions that are only weakly founded. This wastes capital, and drives unnecessary churn.

“If you do not change direction, you may end up where you are heading.” ―Lao Tzu

For the latest fresh thinking on telecommunications, please sign up for the free Geddes newsletter.

Comments

  1. Martin – nice post – though I’m not sure I agree with your statement “users are having QoE failures that are not visible to operators, and are churning”, much as I would like to.

    People have grown surprisingly tolerant of dropped/bad mobile calls, Facetime or Skype video calls that are, frankly, often just a bit crap; web pages that need to reloaded to get them to work, etc. Customers don’t really think that changing network provider will make a lot of difference anyway… unless their ISP has *seriously* underinvested for some time. QoE is a bit of a vague and fuzzy thing, right… I couldn’t give you “marks out of 10” for my ISP (is it them? is it the server? is it my Wi-Fi interference or signal strength?) and I couldn’t tell you what “marks out of 10” for experience I would find myself giving one of their competitors.

  2. Martin Geddes says:

    The tolerance of users to failure varies a lot, but the more valuable the application (and the revenue potential to an operator), the more that dependability becomes critical. MNOs have spent a fortune tracking call drops and remedying the black spots. For TV-like experiences, you can’t afford many buffering events before users give up. 2-way video only works when you have cognitive absorption, which is very sensitive to network performance. And so on.

    There are many things that contribute to QoE, but when it comes to network performance the rest can be treated as exogenous and constant. Where bad experiences are due to inadequate network performance, you can’t fix it by offering a bigger screen and long-lasting battery, for instance.

    There’s plenty of data out there on Web site page loads, voice and video showing how users quickly abandon an application when performance is poor.

  3. I wonder how much of this is due to the kind of AQM in place and/or buffer bloat. In those cases, flows with low bitrates that are latency-sensitive can be affected even at moderate usage levels.

Trackbacks

  1. […] Optimised network planning rules and got 1 year capex deferral. (How? ΔQ measures! Some example data is here.) […]

  2. […] Bandwidth Overprovision – It’s wise to make sure your server is equipped with more bandwidth than it will ever need in “real life.” When it can manage a sudden spike in […]

Speak Your Mind

*