Kent Public Service Network: a template for telecoms success

An interview with Jon Aldington

Kent Public Service Network is an exceptionally efficient network operator, in terms of the value it delivers from the resources it uses. Despite lacking the economies of scale of large national operators, it provides equivalent services for around 1/3 less cost. If replicated widely, this efficiency gain would be enough to transform the economics of the telecoms industry.

I interviewed Jon Aldington, who has had an instrumental role in the network’s management and operation, to find out how they achieve this feat.

MG: What is Kent Public Service Network (KPSN)?

JA: KPSN is a network connecting public sector sites in the County of Kent, which is in South East England. Kent covers around 2% of the land area of the UK, and contains a similar proportion of its population. It comprises a few moderately sized-towns, plus large rural areas, despite being close to London.

The network links up 1300 sites, nearly all within the County of Kent plus the unitary authority of Medway. It has an estimated peak user population of 370,000 during the school term time.

KPSN itself is not a legal entity, but rather is a consortium of public sector organisations. They work collectively to obtain better value for money and service outcomes than if they separately bought telecoms services directly from the open market. Kent County Council procures connectivity and other services on their joint behalf.

What’s your role with respect to KPSN?

I was until very recently General Manager of one of the consortium partners, GOETEC. GOETEC has a role looking after the higher and further education community. I have been involved more broadly with KPSN, working with the team to set and shape the way KPSN sources and manages its services.

The KPSN consortium works to represent the interests of the demand-side users, and help them to contract the service delivery with other supply-side partners. They in turn source and manage the underlying resources. While I have recently moved on, I am still taking a very close interest in KPSN and in network performance.

A close partner we work with is a managed services supplier. One of their functions is to buy raw connectivity from the open market, and (by sharing it wisely) transform this into the services that end users consume. I was involved from 2010 to 2015 in a re-contracting process for this supplier, where I was the programme manager responsible for the procurement and subsequent migration.

How does KPSN create its services?

What makes KPSN special (and fairly unusual) is how we structure the supply chain so that is users keep control over the network, without them having to worry about the details of its day-to-day operation.

KPSN’s users don’t go out and buy bandwidth or “black box” cloud services, where they are simply captive customers of a chosen supplier. Nor is KPSN a “buying club”, merely working together to beat down prices in exchange for more volume. Rather, we use the internal expertise inside the KPSN consortium members to design and manage the services, in collaboration with our service provider partner, who then operates them for us to an agreed SLA.

That means we keep a handle on the service level being delivered, and retain control over the routers, firewalls, email filters, etc. We work closely with the provider to specify how they are deployed and configured in order to best deliver the performance being demanded. This is a very different model from going to a traditional service provider and asking for X bandwidth to be sent to Y postcode.

How does this save Kent’s taxpayers money?

To see how the model we have chosen works, it helps to start with an example. There are three district councils who have clubbed together to run their IT services jointly. That means they have a joint data centre service, and for resilience the data centre is spread across two locations, both with good connectivity to KPSN.

Part of the service design was the data replication between these two sites. Had they gone to the open market they could have bought a fixed point-to-point bandwidth with a technical SLA. If demand had then increased the service provider would have demanded more money to build more bandwidth to stay within the SLA. That is the standard inflexible (and expensive) approach.

Instead, we are a partnership, buying routers and circuits. We had a detailed conversation about the options with the data centre manager, and we were able to explore different approaches. Some efficiency came from shifting the data transfer to overnight, when the rest of the network was less busy, and thus we could capture the reward for lowering peak demand.

We could also throttle the data, or use different QoS class of service to move data in the daytime at a lower priority. We could we have conversations about routing those bulk transfers a different path to use our capacity more intelligently. Those kinds of controls would not normally be exposed to us.

So instead of renting a fixed capacity, we have bought the underlying circuits and routers and retained control over them. The real win here is that you get much more flexibility to explore options that would be foreclosed to most telecoms users. As a result we extract the best possible use of those assets by working with customers in partnership. We’re frequently finding solutions that involve happy users with their needs fully met and no increase in cost

How does KPSN deliver a better experience as well as save money?

Again, this is best illustrated with an example. As you might imagine with a network this size, we have many public sector partners who want to run VoIP, which is sensitive to packet delay and loss. Classically, we would have an SLA with the service provider that says “we will never deliver a jitter level of more than X milliseconds on this network between any two points”.

This has two problems. Firstly, that SLA is over-stringent for non-voice traffic. Secondly, the delivered quality will nearly always exceed the SLA. That means you are paying a lot more just to make VoIP work. Furthermore, it is inflexible, in that when there is excessive load causing the SLA to be broken, you will experience degraded voice calls long before you trigger any upgrade or remedial action. This may have a significant provisioning time, during which there is user dissatisfaction with QoE.

Instead, we are able to measure and model the service quality ourselves. We then use that QoE visibility and control to optimise the network through QoS mechanisms (in the short term) and better planning and provisioning (in the longer term). We are not artificially constrained, so we are not forced to upgrade too early or too late.

How have you been advancing your efficiency over time?

We think one of the main success factors for us is how we make very detailed network measurements. Historically we had monitoring in place of the usage of the network, averaged over a 5 minute period, as is typical for our industry. This gives you some indication of whether you have a network with ample capacity.

However, it doesn’t tell you if a VoIP call has poor or perfect quality. Measurements taken over minutes miss the important short-term peaks. You could have a network that is 80% busy, but if packets are evenly spaced, voice packets fit in without any problem. Equally, you could have a network that is 1% busy, but if that comes in huge bursts, the voice packets can be badly impacted. A link might look empty, but voice quality is poor.

We realised in we needed a different measurement. So we engaged with Predictable Network Solutions Ltd, who did “network tomography” for us. They looked at the busier links, and the passage of individual packets, deducing the likely quality of a voice call. The results surprised me a lot. As someone who has been in network industry for a long time, and I had assumed that 60% or 70% busy meant that there was no QoE problem.

What we found was that sometimes the link was 80% busy, but could happily carry voice traffic. At other times it could be under 10% (and sometimes well under 1%), yet there was significant potential degradation of a voice call due to bursty traffic. (MG: See the data for yourself here.)

The problem we had found with QoS mechanisms is that if you deploy them to mix voice and bulk data then you are flying blind: you have no visibility of what QoE is truly being delivered. We are now working to instrument the network with high fidelity much more widely, so we can see these QoE degradation issues. We can then make the right scheduling choices, rather than use approximate rules of thumb.

As a consequence, we can run our network ‘hotter’, and defer capacity upgrades. Indeed, the idea of running a network at 100% busy would usually have its managers scared, and rushing to buy a bigger network. Yet there is no inherent technical reason why you shouldn’t run a network that hot. You just need to have measurements to tell you if you have a problem, and what you should be doing about it.

So in terms of advancing efficiency, KPSN is not fully there yet, but we see the steps needed to do things not previously seen as possible.

What room for further improvement do you see?

Another big challenge for KPSN is providing connectivity to many different needs: schools, local government, police, libraries, fire and rescue, and so on. For emergency services, we are constantly hearing that they need a resilient service, and have confidence that network element failure won’t impact the service. Meanwhile, for a primary school, a single non-resilient connection generally delivers more than enough reliability.

We have a large network, and have gathered enough data over time to know that for any for individual connection they will have an outage every 3-5 years on average. It’s a distribution, so some may be more often, other less so. When we talk to primary schools, they are willing to take a chance and pay roughly half as much for a connection that might, on average, be degraded or out for several hours once every few years.

The problem then is supporting both resilient and non-resilient requirements on the same network core.  So we have been looking at this problem carefully, and believe that we can be a lot cleverer.

For a section of core gigabit network, we would previously have needed a resilient gigabit failover link. However, only a small fraction needs to be fully resilient, which means we can put in a 1Gbps link, but back it up with a 100Mbps link. That means if the primary fails, the school gets de-prioritised. They don’t necessarily lose all service, but there is an impact.

The option to offer resilient and non-resilient services to any partner means you only pay for your actual availability needs. It’s the user’s choice: you can pay us more for a fully resilient service with no degradation in the event of a single failure, or less money but take on a quantified risk. You have the flexibility and can make a rational business choice.

As a result we are able to defer yet more expensive upgrades, and target capacity investments, keeping costs down and performance up.

How much is all this saving worth?

We estimate that if we had carried along the traditional route then we would have incurred a capacity upgrade capex of £1m, and recurring operational costs of £500,000 a year within the next 3 years. As a result of the changes we have made, we can save most of this. We can delay upgrades significantly, and when we do upgrade, prices will have come down further, resulting in a double net saving. Meanwhile the service users are getting is just as good.

To put that in perspective, our baseline recurring cost is about £1m. So we’re saving 1/3 of the “business as usual” cost to serve around 60 points of presence, some rural with only one connectivity supplier available. Whilst the absolute numbers may not be huge (due to our local scale), the saving is large on a proportional basis.

Looking forward, both cost and QoE are important to us. The potential cost saving is difficult to estimate: supply costs are dropping, and demand is evolving, so it’s a moving target. No doubt there is a very significant further saving possible. We estimate in the long term that we will continue to provide around ⅓ overall savings, sometimes more.

Meanwhile, we can deliver better QoE, and we are able to measure it. When an organisation complains to us, say, that voice quality was terrible, then we can see exactly what was happening and how much (if at all) KPSN contributed to the problem. This contrasts with the “finger in the air” approach of traditional suppliers, and their meaningless terms of “we won’t support voice unless QoS is on and load is under 90%”. By taking back control, we can tell you exactly what QoE was delivered at any time, using hard numbers and solid measurements.

What can other public service network operators learn from KPSN?

The most important message is that there is another way than the simple traditional “measure average utilisation, and when it reaches 60% or 70% then upgrade”. This won’t give you the QoE or cost that you are looking for. Whilst that has traditionally proven mostly alright, there are significant edge cases where that doesn’t work.

When we find links that are not fit to deliver high quality voice, we have other choices available to us: we can schedule the traffic differently, or route it another way. This kind of demand-driven thinking is fairly new, and not yet widely adopted across the industry. Operators are used to watching link utilisation and upgrading when it gets ‘hot’, and there is a strong commercial incentive for equipment vendors and telcos to keep it that way.

Because of the way we’ve procured services, we have a lot of freedom to optimise the supply to meet demand, and where it makes sense to optimise the demand to meet supply.

The challenge is that as you start to implement multiple classes of service, or perform routing differently, then you have the potential to create complexity and new cost. We must balance that operational complexity against the cost of just paying for more bandwidth or another router, which might be expensive, but is very simple.

I should say at this point that it’s early days, and we’re only beginning to reap the benefits. However, the early indications are very good and I’ve no doubt that there’s a lot more potential.

We certainly believe that for our use case, with 60+ PoPs and 1,300 sites, the cost equation is very much in favour of intelligently applying such optimisations to give us better QoE and lower cost.

What are you observations on the telecoms industry as a whole and its level of sophistication?

Having managed a demand-led model with KPSN for several years, my personal view is that there are numerous examples of where the telecoms industry is failing its customers. What customers need is not bandwidth, but reliable application experiences.

For instance, I use VoIP calls a lot, and very frequently experience blips and glitches. We still can’t deliver consistently good quality audio. Indeed, some calls just fail completely, or the person you are talking to sounds like a Dalek, or becomes inaudible. Why is this happening in 2015? These failures should be extremely rare, yet they are all too common.

Another example of how we are failing our users is at home on broadband. I would like to be able to choose the order in which things degrade or fail as load increases. I do not want a teenager’s web surfing habits obliterating my business Skype call. Yet I can’t easily buy a service to guarantee any application outcome, like consistent iPlayer with no buffering. Why is this? We have the technology.

A final example is that my wife works for local company, who do lunchtime training by video link. These sessions were regularly interrupted by people watching BBC News and surfing Amazon in lunch hours. So they banned Web access at lunchtime. The video training session failures improved slightly, but not significantly. So they bought more bandwidth, but this didn’t solve the problem either.

The eventual solution was a dedicated broadband line just for training videos. This did solve the problem but at very significant cost. Yes, applying QoS might have fixed the problem more cheaply, but it demonstrates that such services aren’t readily available on the market and the “just throw bandwidth at the problem” mentality reigns supreme.

Why is the industry struggling so much to make these basic services work? There should be no issue. Yet you cannot buy a reliable guaranteed service at any budget. As a result, customers have become accustomed to this failure. The whole industry needs a shake-up.

The lesson from KPSN is that there is an alternative (and better) approach. We buy the circuits and routers, and have a service provider to configure and manage them so that users do get the performance and reliability they need. The standard model of contracting to a given bandwidth and SLA wipes out whole swathes of options at a stroke.

For KPSN’s success to be replicated elsewhere, two things need to happen. Firstly, you need meaningful measurement in terms of the user application experience being delivered. That means instantaneous packet by packet measurements, not broad brush averages that at best are rules of thumb and at worst bear essentially no relation to QoE. Secondly, you need the right kind of contracts in place that deliver the application outcomes, whilst allowing the network provider the freedom to trade around resources to get the highest resource use efficiency.

To get in touch with Jon Aldington you can email him at jon.aldington@canterbury.ac.uk

For the latest fresh thinking on telecommunications, please sign up for the free Geddes newsletter.