Why do resilient networks fail?
Any network can fail, including resilient networks
There have been a couple of relatively spectacular network failures that have made it into the industry press in the past two weeks. Last week Tesco broadband lost a major part of their network leaving many customers without internet access. This week there was a failure that took out London’s emergency services telecommunications network. I suspect both of these networks were designed with resilience in mind – especially the emergency services network – but, they still failed.
About 40 years ago I remember watching a news item on TV about the Advanced Passenger Train (APT). I remember the APT being lauded as the future of high-speed trains. The APT was initially designed to run at high speed on the West Coast line which has many bends and the APT was the first train to have tilt action to maintain speed around bends. The APT suffered with many design issues and one of my lasting memories was seeing one of the design engineers being interviewed on TV on the platform next to the train looking totally deflated. I was part way through my apprenticeship at the time, and I was beginning to understand that look and I really felt for him. Clearly this was an innovative design that on paper should have worked. But, it certainly did not live up to expectations that day.
”Just a week later and another major failure, this time affecting London’s emergency services.”
I am sure any engineer involved in designing complex systems knows exactly how that engineer felt. It doesn’t matter how much time you spend on the design, build and testing stages, it is inevitable that some unknowns remain that need to be resolved post build. Assuming your design sees it through to go live we can look forward to the acquired issues – problems that arise during operation. Some are relatively easy to spot and resolve while others can present a real challenge lasting days, weeks, months and even years. Having spent much of the past 30 years designing and troubleshooting resilient networks I have investigated many serious problems. Some of these have put me in a similar place to the APT engineer. I hasten to add that on the majority of these occasions I was engaged to troubleshoot for third parties where I had to unravel someone else’s design.
Tesco made the national news last week with their £6.4B record business losses. The day before this news broke their broadband service failed leaving 10,000 customers in different parts of the UK without service for more than six hours. The problem was said to be due to a “technical failure” which they resolved with the help of their network services provider Vodafone. Talk Talk are set to buy Tesco broadband later this year.
”It was either good contingency measures or good fortune that prevented any serious outcome.”
Just a week later and another major failure, this time affecting London’s emergency services. The emergency services network is a dedicated secure system called Airwave. Airwave was originally established by Telefonica UK Ltd (O2) but was later bought by two Macquarie Group funds. An ambulance service spokesman said Airwave had been down for less than 40 minutes. The service is critical for public and officer safety and is used to summon emergency help for any officer or staff member in trouble on average once every six minutes. It was either good contingency measures or good fortune that prevented any serious outcome.
I would assume that both the Tesco broadband network and Airwave are resilient networks. I would take an educated guess and assume that Airwave is highly resilient. But clearly something caused major outages in both cases. So, why do resilient networks fail?
Design is always the first consideration. If the network is designed to be resilient and to tolerate failure of any component and there are no single points of failure then the design can almost be ruled out. But, I will come back to that later.
”…problems can arise when the power does not fail cleanly.”
Power is often blamed for many network failures. Resilient networks are designed to survive power failures but problems can arise when the power does not fail cleanly. If there is a total loss of power it is easy to initiate automatic survival and route traffic around the problem. But when the problem is intermittent or there is a partial power failure – for example one of the DC power lines fluctuating – the monitoring and rerouting mechanism may fail routing live traffic into a malfunctioned part of the network. It was reported that the Met Police believed the Airwave outage was due to a power failure. Unless this was the result of a fundamental design shortfall then I would guess it was due to either a complex or partial power failure, or a coincidental failure of primary and backup power supplies. On several occasions I have seen a series of power failures where a primary, secondary and generator back have all failed. I have also seen generators start up and then trip the breakers, promptly run out of fuel and on two occasions blow up the power supplies in the equipment.
Software bugs on networking equipment can be devastating and are capable of taking the most resilient networks down. They are typically intermittent and difficult to identify and resolve. They can waste a huge amount of resource in the process. Some years ago I investigated a devastating issue in a highly resilient network in the financial services sector. The issue affected all of their 1000 plus sites but not at the same time. When the problem arose it would take out the site completely. By the time I was engaged the problem was between two vendors and the customer was about to legal action. I located the problem to a bug in the modules that were added to equipment on each site. A component required for network resilience (ironically). This particular problem had been blighting the customer for months.
”The options open to network staff and third party technicians to create havoc are endless.”
Software updates can also cause resilient networks to fail. Around 10 years one of my customers lost their entire MPLS network for over an hour following a software update within the public network. Although their MPLS network was resilient, we had a layer of resilience completely autonomous from the MPLS network so their network survived. Our customer was not aware of the major outage even though the MPLS network failed during peak traffic time.
Accidental human intervention – a risk in any network. This could be something as simple as someone plugging a network cable into the wrong socket or an IT technician inadvertently unleashing a virus onto an operational network. The options open to network staff and third party technicians to create havoc are endless.
”…I have seen computer room air conditioning units resembling Niagra Falls”
Malicious intent is a more sinister side of human intervention but it does happen. I have seen major incidents where an employee has had a grudge against his boss and has taken revenge on the network. On one occasion the individual nearly electrocuted himself in the process. Several years ago there were some cases of theft from data centres and communications installations. The thieves broke in to the communications facilities and stole modules from the switches and routers.
Unforeseen circumstances – any failure of a resilient network could be referred to as unforeseen. But I am referring to anything not covered in the above. In my own experience I have seen computer room air conditioning units resembling Niagra Falls sending an occasional deluge through the data centre. On another occasion in another location the air conditioning had failed and the equipment reached temperatures that were literally hot enough to fry an egg. Network and data centre designers must be diligent in their design process to ensure as many potential unforeseen circumstances are designed around. This is where experience really comes into its own.
What can be done?
Although design plays a major part in the resilience from the outset, change management with well-rehearsed change and regression planning plays a vitally important role in ensuring the resilience facility is not compromised during the operational life of the network. The rehearsal process will ideally include an offline ability to test any changes as well as the regression plan. A critical part of the regression plan is to determine the latest point at which to stop the change and action the regression in the event of the change not going to plan. Test processes must be as near to real life as possible to be of any real use, even if the scale cannot be reproduced. Test procedures must be documented and as many potential mid-change failures that could occur need to be taken into consideration and included in any change rehearsals. If as many hypothetical problems have been considered and the course of action or regression rehearsed, the personnel carrying out the change team will be prepared in the event of any of these problems occurring – even it is at 2am.
The new ESN
While researching this blog I discovered that Tesco broadband are in the process of being sold to Talk Talk and the government are in the process of procuring a new Emergency Services Network – ESN. What I found most interesting about the ESN is that the new solution is going to use commercially available networks with special Service Level Agreements that give the emergency services priority over public traffic. Airwave is a dedicated network using spectrum assigned to the emergency services but this method was deemed too expensive for the ESN. The contracts for the ESN will be issued later this year with installation planned from 2016 through to 2020 as the existing Airwave contracts expire.
Useful links
Tesco broadband offline for thousands across UK
London’s emergency services experience telecoms failure