The lean and antifragile data centre – part 1

Cloud is a new technology domain, and data centre engineering is still a developing discipline. I have interviewed a top expert in cloud infrastructure, Pete Cladingbowl. He has a vision of the ‘lean’ data centre and a better kind of Internet for users to reach it. He also has a roadmap for how these can be practically realised. The key is to apply established theories of value flow from more mature industries.

The lean and antifragile data centre

Part 1 – The importance of managing flow

MG: How did you become an expert in systems of value flow?

PC: As a young student engineer I worked in a coal mine in South Africa. There were several of us doing projects, all of which had the goal of improving the productivity of the mine. My task was to produce a maintenance training plan for the conveyor belts that moved the coal from the rock face to the railway truck, which delivered coal to the customer. To justify the additional cost resulting from improved maintenance I studied the impact of the breakdown of that flow on the mine’s overall performance.

At the end of the course, other student engineers presented their findings on how to improve the cutting at the rock face: doing it faster by looking afresh at the geology, and designing new machines to cut coal. Naturally, the managers were all very interested in how much more coal they could cut by spending money on new machines and geological studies.

My team was last. I put up an overhead projector slide, which calculated the amount of money the company was losing from breakdowns of the conveyor belts. I merely had to look outside the window, watch a breakdown, and see how no coal went into a truck. I then worked out how much throughput was being lost in terms of revenue. When the conveyor stopped everything behind it did too, so the amount the machines could cut at the rock face was constrained by the rate at which the coal was loaded onto trucks by the conveyor belts.

Our recommendation was that an improvement to the maintenance regime would have a big reward for only a small cost. The Mine Captain stopped the meeting, called in his managers, and put this into action right away! The capacity of your machines doesn’t matter if you can’t pull the resulting value through the complete system.

This is not the only example. At another mine there was a bonus on blasting gold ore, but the processing plant couldn’t process it fast enough. As a result, the company was paying over-the-odds merely to grow a mountain of rock.

What I brought to the cloud industry, and helped me to make a valued contribution, were such concepts of flow. In particular, I have learned how to balance flow through systems, and understand the relationship between supply and demand.

In what capacity have you applied this learning in cloud and telecoms?

In the ICT sector I have occupied a number of senior executive roles, including SVP Engineering & Operations at Interxion, and Global Crossing (now Level3) for EMEA. I have been responsible for a breadth of functions: engineering, operations, customer service, and IT. More recently I have consulted to IXcellerate in Russia on data centre design as CEO and founder of Skonzo Ltd.

My route to these senior positions was through being a project leader and advisor for the design, construction and operation of multiple digital supply chains and cloud infrastructures. I have been responsible for the operation of infrastructure all the way from global subsea networks, through 1000 site wide-area networks (WANs), to 100,000 seat hosted VoIP platforms.

This has exposed me to a wide range of problems at the design, build, and operate stages. These occur at every layer of the cloud stack, from pure colocation IaaS (like generators and cooling systems) to hosted software platforms.

In your experience, how have other industries adopted flow-based principles?

Primary industries, like oil and gas, have long known that flow is important, all the way through the system. In the secondary manufacturing sector we have also seen a transformation towards managing flow.

For instance, early in my career while doing precision manufacturing, I found that materials requisitioning software assumed an infinite capacity of machines to process it. This resulted in flow problems that demanded a different way of thinking. So we learned from the Toyota Production System (TPS) about flow and the importance of reducing work in progress (WIP).

I also learnt valuable lessons from studying Kaizen (continuous improvement). One Kaizen expert told me a story of when he was in Japan and took a tour of a factory.

He saw one of the machines and remarked to his host: “I remember the old X51 – they produce 50 widgets a minute. We have the new X500 and it does 100 a minute!”. The Japanese host quietly mentioned that their “obsolete” X51 had been continually improved and now produced 120 a minute. This underlined to me the importance of continuous improvement. You must never stop improving and continually reducing waste!

Knowing what to improve is something that I learnt from the Theory of Constraints (TOC). Every system has a flow constraint, and improving that improves the entire system. Improving anything else is a waste of time and money.

Managing the entire system as a set of flows is the key to improving customer satisfaction, increasing throughput and reducing costs. Achieving all three at once is what makes managing flow (rather than capacity) so powerful. This is done by controlling the flows through scheduling and buffer management.

In these other industries product engineering was fairly mature, especially reliability engineering and how to make things safe. What was still developing was the application of these new flow management methods. They give a holistic understanding of systems and the flow through them, by understanding the relationship between supply and demand.

How did you begin to apply these ideas to telecoms?

In the oil industry I had been running wireline networks, both analogue and digital. When based on the oil rig we collected vast amounts of data from the oil well, processed it, and transmitted it to the customer. The first computer I used was a PDP11 with 2Mbit of memory that was as big as a whole blade server today.

My first data transmission from the North Sea to Houston was at 9600 baud. The importance of computing power and quality network throughput to efficiently delivering product to the customer was ground into me during many long nights slowly transmitting and retransmitting information.

From the oil industry I moved to manufacturing and supply chain management. Here Manufacturing Resource Planning (MRP) computers were predominantly used to decide the production schedule, i.e. what task should be done by whom and when.

The flaw in MRP schedules was that they assumed infinite capacity, so that there was no constraint. We tore those plans up, and gave them Kanban boards instead to visualise workflow and limit WIP for production lines making standard product. We applied TOC’s Drum-Buffer-Rope in the job shops where the product changed frequently. Whilst the MRP system was good for a high level plan of what to order, the software was only as good as the rules of the method, which didn’t manage the flow.

I then worked at Racal Telecom on IP networks and X25. I was tasked with taking a new product to market that did voice and data over one link (thus being ahead of its time). I worked my way through network operations, IT, operational support systems, workflow management and inventory systems. I found that what really mattered was network performance management, i.e. managing the flows.

Then Racal was bought by Global Crossing, and I joined the team managing their new global subsea networks. When Global Crossing entered Chapter 11 bankruptcy we had no capex, and the headcount went down from 14,000 to just a few thousand.

At the time the network had only just been completed and reliability was poor. I set a target of 99.999% availability on DWDM/SDH networks. People thought we were crazy, but we achieved it. It required a continuous improvement methodology: we measured, isolated and managed the faults.

That target and approach was then applied to the IP network which is now (via the acquisition by Level3 of Global Crossing) the biggest IP network in the world.

Part 2: The need for antifragile engineering

Part 3: A new Internet architecture & politics

Part 4 to follow: A new intelligent network emerges

For the latest fresh thinking on telecommunications, please sign up for the free Geddes newsletter.