The growing telco complexity crisis

The telecoms industry is facing a systemic problem of high operational complexity and excessive cost. We take a look at the root causes, and how to tackle them.

Every telco in the world wants to both increase the quality of their customer experience and also save money by lowering opex and deferring capex. A pervasive industry barrier to achieving this is one of complexity, which exists at many levels:

  • Managing the relationship between QoE, network performance and cost
  • Complexity between marketing, design and operations
  • Complexity between technology layers
  • Complexity of middleboxes and mechanisms
  • Complexity of metrics & processes

Growth in complexity impedes the ability to managers to make rational business decisions, and limits our technical progress.

Abstraction hides complexity

The way to tackle complexity in any domain is through abstraction. This captures whatever is relevant, and hides irrelevant variation. Yet abstraction only reduces complexity if it is appropriate.

In the context of telecoms, “appropriate” is about ensuring that when you “compose” the (abstracted) properties of all the sub-components, it delivers the necessary end-to-end service outcome.

For example, when we size telecoms circuits to deliver phone calls, we have an abstraction model called the Erlang to relate demand to supply. Whether the data is being delivered over copper, aluminium or fibre optic cable is irrelevant detail.

Getting the right abstraction

The telecoms industry is struggling to find the right abstraction for packet data. As a result, the industry as a whole has great difficulty predicting cause and effect; network science is an emerging discipline.

This results in organisations making decisions that fail to create satisfactory QoE for customers, and appropriate costs for themselves. Indeed, many of their actions are counter to their own long-term interests.

In the absence of appropriate abstractions, complexity affects the ability of telcos to execute their core task, which is to construct supply that is fit-for-purpose for the expected demand.

Without sufficient characterisation of demand, there is no way of knowing whether supply is fit-for-purpose. And without a satisfactory means of predicting the interaction between supply and demand, in particular the underlying network performance being delivered, it is not possible to create fit-for-purpose supply.

Tactical fixes create a long-term strategic problem

Telcos are drowning in complexity because they lack an accurate picture of delivered QoE, its relationship to network performance, and the elements that contribute to it. There are endless tactical fixes to deal with the discrepancies between the management model and reality. These multiply complexity.

For instance, a common network metric is bandwidth, which is a poor proxy for QoE. Consumer “speed test” comparisons drive inappropriate marketing aspirations for more bandwidth, which in turn drive counter-productive technical behaviours in terms of QoE.

It’s not news that QoE depends on much more than “bandwidth”. Applications like online gaming require consistent low delay for a portion of their traffic. However, configuring all the elements to cooperatively achieve this goal drives both costs and ongoing management complexity.

Ever more new services are now needed to drive new revenues and spread infrastructure risk: voice, SaaS, enterprise comms, etc. Yet this is becoming more difficult over time, not less. OSS and BSS integration both remain expensive and failure-prone (creating labour-intensive operational exceptions). Every new system increases the cost multiplier for existing systems (i.e. super-linear cost scaling).

We are heading towards unmanaged complexity

If we continue on our current path, we face diseconomies of scale that overwhelm the management processes and network budget, and create operational fragility.

We will not be able to predict the value to the business of spending on capacity and upgrades; ongoing capital investment becomes more difficult to justify, both internally and externally.

There will be a deterioration in our ability to adapt to a changing future, as “fixes” from the past become constraints and liabilities. More and larger emergent failures will increase reputational risk and incentivise adoption of yet more risky mitigation approaches.

My colleagues have encountered many examples of failure due to complexity. For instance, a large fixed operator increased capacity in its network core, which increased delay and jitter in the access network, which caused gaming (which used to work) to stop working.

Another converged operator rolled out series of managed services over a commodity broadband access network. They failed to model the hazards of real-world deployment, so thy wrongly assumed the working lab prototype could be directly transplanted outside, and would be economically feasible to manage and maintain.

What we want instead: managed complexity

What we want are economies of scale combined with operational robustness. In our ideal world, the arming of performance hazards would always be visible. The interrelationships of QoE, cost and network performance would be quantified, predictable and managed.

We would have appropriate abstractions to collapse complexity. The network as a system would be assembled from predictable network objects that can be predictably composed.

In our dreams, network performance management would be automated with clear identification of root cause and resulting effect on QoE and/or cost. All network monitoring would deliver actionable information, not merely large collections of performance data points.

Our goal would “double-loop learning”, with data driving improvement in the network management system itself. In this nirvana, human staff are used to perform value-added tasks in an environment with lower operational stress. Overtime and unsocial hours costs are reduced.

When we meet our commitments to our customers, and have higher value to business, this frees network management to focus on higher-order business innovation issues.

So how to go from increasing to decreasing complexity?

It is possible to turn this tide of complexity. It requires the following:

  • A robust, quantitative and predictive model of cause and effect in the relationship of supply and demand.
  • A proper understanding of the resource ‘trading space’ you are managing, so you can relate your costs to the experience being delivered, and allocate resources rationally.
  • Knowledge of how configure that ‘trading space’ to match supply and demand, in terms of the diversity of services sold and the accompanying levels of service assurance.

For the latest fresh thinking on telecommunications, please sign up for the free Geddes newsletter.