Frank Dederichs: I completely reject the claim that we have neglected our backup systems – and definitely not systematically. We not only invest enormous sums in our infrastructure, but also in the reliability of our systems. In many cases, these strategies work perfectly. This is precisely what goes unnoticed and that's exactly as it should be.
Two strategies are used to ensure reliability. Firstly, the building of redundancies, and secondly, the resilience of systems. Redundancy has been a constant topic in the public debates of recent weeks. By “redundancy” we mean that the network is designed in such a way that important components or communication channels are duplicated. These take over the task of a defective component in an emergency.
However, the experience of recent weeks has shown us that redundancy alone is not a catch-all solution because in some special cases it doesn't work – and too many redundancies also add complexity. Therefore, in addition to existing redundancies, we also want to further improve the resilience of our systems. Quality is our absolute top priority. The difference can be shown in the example of a car tyre. If your tyre deflates as the result of a puncture, you just replace it with the spare. That's redundancy. But absolute redundancy is not always the best solution. There's little point in having five spare tyres. We are therefore focusing increasingly on resilience. We would consider run-flat tyres to be resilient because they keep on rolling as they lose air – they are a bit slower and they don't last forever but they will get you to the next garage.
The periods between software updates have become shorter and the lifespan of hardware is decreasing. The network also has to be continuously extended to cope with the growing volume of data and many other requirements. As a result, there has been a huge increase in the speed at which we have to make changes to our systems. We now make more than 4,000 changes per week to our systems – so errors can never be completely ruled out.
With each adjustment we have to ask the question: how complicated is this? And what is the potential for damage? This results in the classification for how we handle the change. It can lead to errors of judgement, as was the case with the fault on 11 February, 2020, when a software update was installed simultaneously for several critical network components in one night because the risk was incorrectly assessed as low. As a precaution, the changeover should have taken place over two nights with only half of the components affected each night. This is why we are now checking our processes to further reduce the potential for errors during maintenance work.
New components are usually tested in the laboratories before going live. In the lab network for example, it is possible to check how well five components will work with the traffic simulated by a load generator. However, functioning well under laboratory conditions does not guarantee that this will work in the real network with thousands of components.
It is not always possible to conduct a test. For example, if a serious security vulnerability is discovered, it must be fixed quickly. So to a certain degree we have to accept the risk of malfunction because the usual test procedure cannot be used.
Operational stability is developing positively in the long term. This is the case, for example, in the residential customer segment, where the amount of downtime has been reduced by 40% in the last three years. Swisscom is Switzerland's main provider, so a fault automatically affects more customers than it would with our competitors. We want to improve our performance for business customers. This is where even the smallest outage has a major impact – digitisation means that our infrastructures, systems and software are more closely integrated with our customers' business processes than ever before. So tolerance thresholds are much lower in the event of faults.
No telecommunications company in the world can ignore IP, the global standard on which digitisation is based. Services are ever more closely connected now. Photos taken on our smartphones are transferred to the cloud. Or our TVs can be operated with our smartphones. Networks have to be connected so that the platforms can exchange data. It is the IP protocol that makes this networking possible. It is not the protocol itself that's tricky, but the fact that these connection networks exist. And just like any other technology, IP is not completely fault-free. But it is just as reliable as analogue telephony and by switching off analogue technology and switching to IP we have reduced the complexity of the network – and thus its susceptibility to disruption.
Swisscom reacted swiftly to the events of January and February 2020 and implemented various short-term actions. The process for implementing changes was therefore immediately tightened up. The aim is to be more consistent in preventing risk and to provide closer support in the event of critical changes. Alongside this, however, we are already working with the emergency services to increase the reliability and redundancy of the systems as quickly as possible.
We have also initiated two projects that are expected to have an impact in the medium to long term. On the one hand, we consistently examine our network and our systems for “single points of failure”. These are individual weak points that could, in the worst case, put an entire system out of action. If we find this kind of configuration, we make it a high priority to eradicate the vulnerability as soon as possible. On the other hand, we have set up a broad-based audit to examine our entire systems, networks, processes, culture and other issues. Our aim is to make long-term improvements.
Alte Tiefenaustrasse 6
Postfach, CH-3050 Bern
Tel. +41 58 221 98 04
Frank Dederichs is a member of Swisscom's division management for IT, Network & Infrastructure and is responsible for Cloud Engineering & Operations.