Why network virtualisation is hard

I co-wrote a paper for a client last week on network virtualisation, together with my colleagues at Predictable Network Solutions Ltd. You will no doubt be familiar with data centre virtualisation – i.e. the cloud. The purpose of these technologies is to allow greater sharing of an underlying resource, as well as greater flexibility in capacity and performance.

The next stage in the process is network virtualisation. This is an edited version of the paper which explains why this is hard. In summary, networks have always been large distributed computing environments. All such systems share an underlying issue: what is the optimal location for each function, for a given set of resources and loading?You might think that this basic problem would have been solved decades ago, and is in all the textbooks on distributed computing. It wasn’t and isn’t. That is both a problem and opportunity for equipment vendors and mobile network operators (MNOs).

Networks are supercomputers

Mobile networks are becoming increasingly like large scale parallel supercomputers. For example, we have worked on the systems used in the data capture and analysis for Higgs Boson at CERN, which have the same fundamental issues.The relative costs of the component computation and communications technologies continually change. Furthermore, the interconnection between these functions can no longer be assumed to be carried over dedicated circuits, as all traffic is now over a common statistically shared transmission medium. The cost structure and performance of the transmission can vary from one territory to the next.

As a result, the optimal location of each function in the distributed architecture also can change. The performance is specific to each network configuration, rather than generic protocol behaviour. As such, it is something that will not be solved by standards bodies. Indeed, many explicitly rule it out of scope as an activity.

This dynamic has created a new discipline: the performance engineering of complete distributed architectures. Critically, this is distinct from the engineering of any of the sub-components.

Finding the optimal trade-offs

This optimal location is a function of both the desired customer experience and total cost of operation. The customer experience depends on the quality of experience (QoE) hazards; the total cost of ownership depends on the cost of mitigating or addressing those hazards, and the level of financial predictability that results.

This plays out differently for each part of the mobile ecosystem:

For MNOs: where to place caches, radio controllers, or internet breakout.
For the content distributors: where to place delivery systems, when/whether to use multicast, where to place transcoders (from centrally down to every set top box)
For cloud-service providers: where to place the application functionality – how much local, and how much remote, given that functional splitting increases implementation complexity.

In all these cases there are engineering design trades of cost against application performance and the QoE and financial hazards.

Virtualisation’s resource trading and testing issue

The current virtualisation trend has magnified the issue over how best to allocate resources. Once a function can be located in many places, the total number of combinations becomes too high to test and validate empirically before deployment. It becomes cheaper to build the complete system than to do the testing, and you still can’t know if it will work when deployed, or how it will fail!

As currently built, networks behave badly as resources saturate under load. Thus network engineering has always implicitly been about knowing the predictable region of operation, and keeping the system well within it. That is under threat from cost pressure to intensify resource use, and now comes under further pressure due to the complexity and freedom of action that virtualisation brings. The hazards can push the system outside of that predictable region, and cause both localised and widespread failure. This is similar to the way that a power grid failure can cause localised, regional and cascading outages.

As such, new analytic techniques are needed. This is creating demand for distributed architecture skills that were once highly sought after (a quarter of a century ago) in the narrow domain of parallel processing and supercomputing. The problems are not new, but the scale is, and the consequences of not dealing with them are unprecedented.

Learning from extreme environments

This need for such predictive analytic techniques was the situation we found when working on the US Department of Defense’s Future Combat Systems project. In a military deployment, a network might only be used for a few days. It therefore had to be capable of being planned, from scratch, in 24 hours. Once in operation it was used for several safety-of-life functions where unplanned failure was unacceptable. Such extreme demands could not be satisfied with traditional telecoms approaches. Indeed, it turned out that the whole project was infeasible given the technology available at the time.

In another instance during the 1990s, one of our team was a domain expert in the safety-critical systems research centre at the University of Bristol. This endeavour was supported by the nuclear power, national railway and air traffic control services. A new national air traffic control system was being deployed, and in preparation for the switch-over a component of the old one was being upgraded.

One fateful Monday morning, this minor system upgrade went live. It was applied to a proven system that had been tested extensively, with redundant sites in Manchester and London. That system happened to be operating over a shared link.

Simultaneously, another legacy system failed. This system had been running since the 1960s. The failure was because of an unexpected interaction between the flows: the traffic had exceeded some implicit schedulability constraint, and activated a latent bug. They could not even think of trying to fix it, because the source code was lost. (In a mobile network, imagine this is another vendor’s box, over which you have no direct control.)

The end result was the whole air traffic control system went down for half a day, and was on manual backup, with the consequent increase in passenger delays and cancelled flights. It should be noted that the system that failed had worked perfectly before. They had checked there was sufficient bandwidth and that they were not at the capacity limit. It was only the minor change to the behaviour that created this massive outcome.

Virtualisation causes a large rise in complexity

In the past, network engineering was a simpler discipline. You had dedicated allocation of resources to your function. You were independent of what else was going on, and could safely assume complete performance isolation. Any performance problems were under your control.

These military and air traffic engineers used best practise, and didn’t do anything wrong by the book – and their systems still failed. Why? They didn’t know how to reason ex ante about these systems. As a result, new analytic frameworks were developed to allow the predictable region of operation and failure modes to be modelled.

One of the key insights from these projects was to understand the difference between schedulability and capacity. Both the communication and computation resources are finite, and are constrained by both these factors.

We can no longer assume a “dedicated lambda”, or allocate to peak bandwidth, or exclusive use of a single CPU for every process. Instead, a more complex range of interactions needs to be modelled against these two constraints. It’s all about multiplexing: can my demand be scheduled to get the resources it wants in the timescales it requires?

As a result, the architectural design space has ceased to be small, regular in structure, and constructed from a single technology vendor. Instead, it is now large, irregular, and involves interactions of sub-systems from many vendors. The component vendors will blame the systems integrator for problems, and vice versa.

This technical and organisational complexity is a general challenge that the telecoms world is collectively moving towards.

The distributed systems skills gap

Regrettably, the telecoms industry has yet to conceptually and practically grasp this issue. In much of our recent work we have seen many network system designs where the fault resolution process has been to increase capacity to address issues of schedulability. Increasingly we see that approach has run its course, as costs become prohibitive, and indeed in extremis it can exacerbate the failure modes.

We observe endemic effects of the industry’s failure to engage with the schedulability constraint. This is causing problems for fixed-line telcos, mobile operators and large corporate networks. Although our speciality is ex ante network performance engineering, we are often called in post mortem: we are asked to help clients understand why things are failing in an (apparently) unpredictable manner. Modern network deployments increasingly resemble the 1990s air traffic control systems failure. 4G networks, which introduce a slew of new interactions, are particularly prone to these issues.

Often we find that many of the ‘fixes’ that have been deployed have just moved the network’s apparent failure from one location to another. As the ‘issue’ jumps from one organisational silo to another, it creates substantial tensions between different groups. A destructive blame game ensues.

Isolate issues and allocate responsibility

What we have found is that the application of the appropriate mathematical and performance engineering understanding can resolve the technical issues, which in turn helps the social ones. The crucial step is to be able to decompose the system, and know which specific interactions of sub-systems are causing the performance issues. Additionally, we can understand the trade-offs being made, such as between additional computation and communication, and optimise for specific cost or user experience outcomes.

At this point, using the ex ante techniques, it becomes possible not just to understand cause and effect, but to quantify their relationship. Hence we can make recommendations for changes and interventions whose benefits can be quantified in direct business terms. Alternatively, we can definitely show why the expectation is impossible.

Conclusions

People like to talk about networks getting ‘smarter’. What is really happening is that increased functionality is turning networks into ever larger distributed computing environments. Performance engineering for networks comes down to the ability to create, and operate, these complex systems with a high degree of predictability. That has to be expressed both in terms of delivered quality of experience and total cost of ownership.

In the past the portion sold as ‘the network’ kept its control plane data separated and isolated from its user data (e.g. PSTN, TDM, 2G GSM). As the boundary of ‘the network’ grows, it incorporates ever more control functions into a single multiplexed mix. Mobile networks lead this trend, and the increasing use of virtualisation in service delivery creates multiple new layers, each of which interact and generate new emergent outcomes.

As a result, the region of predictable and stable performance becomes difficult to determine. Techniques for network design and planning used prior to virtualisation can lead to unstable and unsuitable architectures when applied to virtualised systems. The industry as a whole has a deficit of the skills necessary to deliver reliable virtualised networks with a predictable cost of ownership.

The techniques we have developed in extreme military and industrial environments, and derived from parallel processing and supercomputing, can solve these issues. However, the number of people on the planet who understand the principles and mathematics to do so is very small. That number must grow in order for the next stage in mobile networking to be a success for operators and users.

We can help you deliver reliable virtualised networks with a predictable cost of ownership. Please get in touch to find out how.

Why network virtualisation is hard

Networks are supercomputers

Finding the optimal trade-offs

Virtualisation’s resource trading and testing issue

Learning from extreme environments

Virtualisation causes a large rise in complexity

The distributed systems skills gap

Isolate issues and allocate responsibility

Conclusions

To keep up to date with the latest fresh thinking on telecommunication, please sign up for the Geddes newsletter

Categories

Why network virtualisation is hard

Networks are supercomputers

Finding the optimal trade-offs

Virtualisation’s resource trading and testing issue

Learning from extreme environments

Virtualisation causes a large rise in complexity

The distributed systems skills gap

Isolate issues and allocate responsibility

Conclusions

To keep up to date with the latest fresh thinking on telecommunication, please sign up for the Geddes newsletter

Categories

Follow us