How to X-ray a telecoms network

When a human feels a persistent pain, they go to the doctor, who will attempt to diagnose its cause. One of the great advances in medicine was the ability to cheaply and non-invasively see inside the body using X-rays, resulting in rapid and accurate diagnosis.

When a telecoms network is sick due to poor performance, most existing diagnostic techniques amount to prodding the patient and asking whether it hurts much. The ability to quickly isolate issues with confidence is generally lacking. What we want is to be able to get the equivalent of an X-ray for a network to make a good diagnosis.

I recently published a presentation on the need for such high-fidelity measurement of network performance. To bring that to life, I’d like to share with you how my colleagues at Predictable Network Solutions Ltd practically go about obtaining a ‘network X-ray’. This process gathers uniquely detailed and powerful insight into network performance.

Why measure at all?

As background, it’s worth remembering why we measure network performance at all. We care about two things: customer quality of experience (QoE), and cost. QoE depends on many things, all of which have to come together, and any of which can sabotage the customer’s happiness. The network has a single, necessary, contribution to QoE: it enables the performance of applications.

Since we want acceptable performance for the minimal cost, we measure networks for two fundamental reasons:

  • Are we delivering the performance needed for an acceptable QoE (and if not, why not)?
  • How much resource ‘slack’ do we have, so we can manage costs?
  • Measuring QoE directly is costly and invasive. It’s bad enough having Skype ask “how was it for you, honey?” after every call. Imagine if that was the case after every network interaction! Therefore, when it comes to capacity planning, you are only as good as the QoE proxy data you measure.

    If the measure is a strong proxy, then you can ensure you only add capacity when it will make a real difference. If your measure is a weak proxy, then you may either be delivering a poor QoE but not know it, or add capacity for no benefit.

    What to measure?

    The only network measure that is also a strong universal QoE proxy is ‘quality attenuation’ (abbreviated to ΔQ). I’ll be posting a primer on it in the near future. ΔQ represents the distribution of packet loss and delay along a path. To capture that information we require multi-point measurement, i.e. observe the same packet passing multiple observation points.

    This contrasts strongly with standard network measurement and analytic approaches, which take single-point measures. These may be taken at multiple locations, but they are observing different packets. They then (typically) average the captured data. In this process, you end up losing critical fidelity, and introducing new measurement artefacts. The end result is only a weak QoE proxy – a mistake that costs the telecoms industry a fortune in churn and capex.

    There are plenty of vendors out there today doing network performance optimisation, often sold under the name of “big data analytics”. The best way to think about their technology is that they are well-paid capex liposuction surgeons, but with shaky hands and bad eyesight. You are taking a lot more risk on board in this process than you imagine. Sure, they can make your capex budget thinner, but what will its effect be on the health of the customer experience as they jab their surgical instrument into some vital organ?

    How to do multi-point measurement

    Here’s how we go about doing multi-point measurement to enable safe diagnosis and effective optimisation. There are three basic ingredients to the process:

    • A low-bitrate test data stream.
    • Probes capturing timing data of that data stream along the path.
    • Processing to turn this into a ‘network X-ray’ for analysis.
    • The test data stream injects special ‘golden packets’ into the network edge, using a laptop or a cheap little box like one of these:

      Network test data generator

      The data stream is created with a special set of statistical properties to ensure the end results are valid. (There’s some secret sauce here…) These packets are marked in a way that makes them easy to distinguish, so we never have to observe any customer data. However, these packets experience the same quality attenuation as the customer flows, a bit like how a smoke particle is jiggled about by air molecules inBrownian motion you saw in your school physics lessons.

      Packets whizz past multiple observation points, at each of which the timing of the packet’s passing is recorded. This timing data is captured using a variety of technical methods. A typical way of extracting test data timing from fast optical links is a small form-factor pluggable (SFP) transceiver like this one:

      JDSU PacketPortal probe

      This data is then uploaded to a cloud-based processing system, where we do Clever Mathematics™ (more secret sauce) that compensates for clock timing skew, and extracts the relevant static and dynamic network performance data.

      The data that you get out

      The typical kind of data the “network radiologist” gets out is a set of charts that look like this (click here for full size):

      Diagnostic result for round trip

      This is round trip data, which can be broken down into the upstream and downstream separately. Current network diagnostic approaches just note an impairment to walking, without knowing which leg is broken. (Click here for full size.)

      Diagnostic result for one way traffic

      There are other ways of processing the data, for instance mapping it to a ‘breach metric’ for the QoE of a particular application of interest. We’ll save that one for another day.

      You can read more about some of the ways in which this data is analysed and used in our previous presentation Advanced Network Performance Measurement Techniques (The full webinar in on YouTube here .)

      Benefits of this approach

      There are three key benefits to this approach:

      The data stream does not disturb the system under observation. Some measurement services apply high test loads that stress the network, offering a full load at the busiest period, for the highest-speed links, without a break. This both affects the ongoing customer experience, has a significant transport cost, and can trigger network upgrade planning rules. Indeed, everywhere you put a test probe gets great performance when it forces endless network upgrades due to its own behaviour! You need to avoid this “Heisenberg effect”.

      The amount of data you need to collect is small, and therefore is cheap. Standard approaches capture vast amounts of (the wrong) data, which then itself has huge transport and storage costs. In a vain hope of correlation being causation, all kinds of regression techniques are applied. This costs you a fortune in processing. We only extract what actually matters.

      The data you get enables a scientific approach to performance management. We measure the one and only metric that matters: ΔQ. This uniquely (a) is a strong QoE proxy, (b) captures the static structural and dynamic network behaviour, and (c) decomposes spatially (along the path) as well as temporally (causes of delay). As a result, we can both isolate issues with total confidence, and accurately predict what the appropriate resolution will be.

      Practical uses of this technique

      These techniques have been successfully used in a variety of environments, such as :

      • Optimising the data extraction from the ATLAS supercollider experiment at CERN
      • Performance optimisation for DSL and 3G networks
      • Fault isolation for small cell deployment
      • These basic ideas are generic: you could use the same approach and mathematics to analyse business processes just as easily as packet data.

        We are currently working on upgrading the system for a client. Instead of one-off ‘network X-rays’, or even repeated ones to create longitudinal data, we are creating a continuous monitoring system. This is more like a functional MRI scan than an X-ray, in that is captures the complete dynamic performance of the network over time, not just at a single instant. This allows for immediate detection of any deviation from the expected network behaviour, and proactive fault management.

        If you would like to discuss how our approach can be used to help you increase QoE and decrease network cost, please get in touch. For more information on the NetHealthCheck™ service from Predictable Network Solutions, consult this sales presentation.