The lean and anti fragile data centre – part 2

In this second part of my interview with Pete Cladingbowl we explore how the data centre industry still has much to learn in terms of managing flows of value. The quality control techniques are well known in other industries, but often have yet to be applied.

The lean and anti fragile data centre

Part 2 – The need for anti fragile engineering

How good is the telecoms industry at managing information flows?

When I came to telecoms I was struck by the lack of established engineering principles. There was little idea of how to measure flow, or how to make a safety case that the flow would be sufficient to meet demand. The telecoms industry hasn’t yet had its lean or total quality management (TQM) revolutions. That means its services are often disconnected from customer outcomes and (billable) value.

The specific problem I found was when we came to IP WANs and managed corporate networks (VPNs). I really struggled to understand how to get the right application outcomes. There were constant battles between the customer and sales, or between IT and the customer. Every time the solution was more capacity, as it was the only tool you had, since you couldn’t see or understand the flows.

There were various attempts to manage flow better – via MPLS, RSVP, diffserv —which helped a bit. Many customers wondered why they didn’t stay with old-fashioned (and expensive) ATM. Others just bought a fat (and expensive) Ethernet pipe plugged into a fat (and expensive) MPLS core. When it came to new applications like VoIP, they built a new (and expensive) overlay network.

So I was always puzzled, and when I started reading your work in the early 2000s, I realized I was not the only one wondering what was wrong. Why was this industry all about (expensive) “speed”, and yet was not making much money, despite its central economic role?

From other industries we knew that the underlying issue was knowing the constraint and using buffer management to manage the flows according to the constraint. When you make capacity the constraint you can only add more of it, not manage it. There were also plenty of clever people in telecoms, many of whom wanted better tools.

The problem then was always the business case, as quality didn’t matter enough. The attitude was to just pay the customer some credits if they complained and sell them more capacity. Until now, there was no real market for quality of service beyond “quantity of service”.

How did you make the leap to applying flow concepts to data centres?

I had the opportunity to build some data centres with Global Crossing and then Interxion, whose customers had high-end demands, their data centre was “mission critical” to them. There were capital projects building new infrastructure. I was working on how the data centre and connectivity related, and the connections and cross-connects at Internet Exchanges (IXs).

Interxion’s customers understood the importance of high reliability and availability. Supplying these needs exposed the relative immaturity of telecoms data centres compared to other industries.

QoS in data centres largely comes down to “is it up or down?” – the focus being on the machines, not the flow of value they enable. Robustness is achieved through (expensive) redundant machines, and improvements in availability is achieved through adding yet more redundancy. This overreliance on capex to solve flow problems should sound familiar by now.

My job at Interxion was to help improve customer service and reduce costs, just as in the other industries I had been in. This means fulfilling value-added demand, and not creating failure demand (like rework in a factory, or retransmits in a network). Being able to differentiate these meant understanding the demands for information flow, and how the supply chain responded to these demands.

Both telecoms and data centres are relatively new industries, deploying complex machines at scale. These machines aim for robustness through redundancy. The standard definition of “success” is that the customer service is not interrupted if you lose an element in the mechanical infrastructure.

Robustness alone is not enough. We also need to be able to anticipate failure, so as to be able to prevent it. We also need to be able to proactively monitor, respond and restore service. If we do these things, we will have good infrastructure that is not just robust, but also resilient.

Why do you say the traditional engineering approach to resilience insufficient?

Many people will have heard of Nassim Taleb’s “black swan” events, which are outliers. Taleb is in the business of risk assessment, which involves judging the probability of something (bad) happening. Any risk assessment takes the likelihood of an event and its impact, and multiplies them.

The standard data centre engineering approach is to take things that are high impact but low probability, put them to one side, and not worry about them. These “improbable” things that don’t get addressed then all happen on Friday at 4pm, Sunday around 2am, or when on holiday and contactable.

These failures can be prevented, and the way you do that is to use the perspective of flow, which is a different perspective to the individual machines. Don’t get me wrong, we still care about the technology, people, and processes. These silos are good and necessary, but they define static forms of quality baked into machines, skills, and methodologies.

The purpose of the system as a whole is to enable flow, so we need to see how these factors interact with their context to satisfy a dynamic quality requirement. From a flow perspective, we want a system that automatically deteriorates in an acceptable way in overload. We know there is variability in demand, and our job is to be able to absorb it (up to a point).

This ties into resilience, but is different in a critical way. How you increase resilience is a process of continual improvement at human timescales. In contrast, we are managing dynamic quality, often at sub-second machine operation timescales. This is a result of how people, process, and technology all interact.

Taleb puts it nicely when he talks about “black swan” events that have low probability and high impact. He describes “antifragile” systems that get stressed by these outlier events, and learn how to respond appropriately. How did a demand or supply change impact the flow? Does it make any difference, and what’s the right way to adapt?

My opinion is that we must adopt lean and antifragile engineering principles in data centres and networks if we are to meet future customer demands.

Can you give me an example to bring this to life?

These concepts of flow need to be seen holistically, as it’s not just about packet flows. For instance, if you run a data centre too hot, then the cooling equipment can’t cope. If you can’t flow the heat out, you have to turn machines off.

It’s a system which has an environmental context; energy in as electricity; electronics to compute and connect; and then mechanical and heat energy outputs. There are energy flows that connect these machines together. It is a single system with dependent events and variability, just like many other engineering systems.

Let me give you an example of how one seemingly small thing can make a huge set of expensive infrastructure very fragile. Energy flow through a data centre often depends on generators when there is a problem with gird power. During one “flow audit” I did I was being shown the big shiny generators and fuel tanks, and told how much capacity they had.

But I focus on the flow, not the capacity. Generators need air flowing in and exhaust flowing out, and I followed these flows from the generator room to where they came in and out of the building. There were several hazards to these flows that could very easily block the flow, and could bring a halt to the whole system. These hazards were removed, or monitoring put in place to detect if they became a problem, so the entire system was made less fragile.

Other people would have focused on the robustness of the machines, and their redundancy level. My “lean” eye was on the flow of the system as a whole, and its system constraint was these air and exhaust ducts and preventing a black swan event occurring.

Part 3: A new Internet architecture & politics

Part 4 to follow: A new intelligent network emerges

For the latest fresh thinking on telecommunications, please sign up for the free Geddes newsletter.