Voice recognition for dialects

AI learning Swiss German

Voice recognition is one of the most important future technologies – provided it is also able to understand dialect. Swisscom’s AI team is rising to the challenge by developing voice recognition systems for speakers of Swiss German.

Text: Christoph Widmer, Image: Keystone, 09 April 2018

Ueli Gerber has finally found the time to clear out his cellar. However, just as he picks up his valuable vase – a family heirloom – the apartment block in Bern suffers a power cut. As his hands are full, Gerber decides to activate the torch function on his smartphone using a voice command: “Hey Siri,” he grunts, “stöu d taschepfunzle ah!” [turn the torch on]. However, the command is not executed. Even after further attempts, the device does not react to Gerber’s instruction.

Faced with such pressure, it has slipped Ueli Gerber’s mind that Apple’s voice recognition does not work for Swiss German. And other voice-controlled systems, such as OK Google, Alexa and Cortana are unable to respond to commands issued in dialect. Voice recognition systems are primarily developed for the world’s most widely spoken languages – such as for English or Mandarin, both of which have over a billion speakers. By contrast, there are only around 4.9 million speakers of Swiss German.

Voice recognition: the future is now

As a result, the technology is booming in the English-speaking world in particular: according to estimates, 40 million voice-controlled gadgets, such as the Amazon Echo or Google Home, have already been sold in the USA. These smart speakers are already ensuring that we no longer need to use our smartphones or computers in the home: on the basis of voice commands, they provide information about weather and traffic, add appointments to your calendar, play your favourite playlist or, in smart homes, switch on lights and coffee machines. Forecasts predict that, by 2020, around 240 million American households will be home to one of these “intelligent personal assistants”. As user interfaces are coming ever closer to replicating the nature and manner in which people interact with each other and the environment, voice recognition and voice control are viewed by experts as groundbreaking technologies. Futurist Amy Webb even speaks of the decline of the smartphone: as voice recognition is set to be omnipresent in the future, it is expected that there will no longer be any need to hold your mobile phone – and this does not just apply within your own four walls.

To ensure that voice recognition is able to realise its full potential, access to the technology must be as natural as possible. A fully developed “voice user interface”, or VUI for short, should ideally also be able to cope with complex sentences or background noise; in addition, users must be able to activate the artificial intelligence that is behind every voice recognition system using a command – or they must at least be able to determine precisely when it is listening. The voice system must also recognise accents and dialects: “the time for voice-controlled systems is now,” declares Philipp Egolf, who is responsible for the voice recognition project at Swisscom. “That is why it is increasingly important that people also be able to communicate with such systems using their natural language”.


Voice recognition systems, which understand the natural language of their users, are already available today. In the transport sector, for example: in the new SBB app, you can already search for routes using a dialect voice command; laborious keystrokes are a thing of the past. Car manufacturers have also recognised that voice recognition solutions within the DACH area must understand more than just standard German. In response to this, Mercedes-Benz recently unveiled its new A-Class. It includes “MBUX”, a multimedia system that enables you to input destinations, make calls or compose or play messages using voice commands. As the driver no longer needs to take their hands off the wheel or their eyes off the road, MBUX is making a huge contribution to improving safety while driving. In addition, the voice control is constantly learning and should, in time, also react to commands issued in dialect.

Optimised customer service

At present, Swisscom’s AI team is therefore working extremely hard on dialect voice systems and Solutions using voice biometrics. Swisscom is placing its main focus on Interactive Voice Response, or IVR for short – the voice dialogue system that is primarily used for hotlines. Here, the customer’s first contact with the company generally takes place via the telephone keypad. A laborious undertaking: only those people who are able to navigate their way through the cumbersome “please press the xy key” labyrinth are able to speak to an expert who is able to provide information regarding the problem. If they get through to the correct call handler, that is. If not, frustration quickly builds, both for the customer who is eagerly awaiting help and for the employee, who can do nothing but forward on the call.

Thanks to voice recognition, customers are now able to ask their questions directly using their voice. In an ideal world, callers would immediately receive a response from the AI. If it does not know the answer, the AI is still able to recognise key words within the description of the problem – and at the same time forward the caller to the correct call handler, who will provide the customer with expert advice. Thanks to automated back-translations, it would even be possible to provide cross-language information without any problems. In this way, speech recognition guarantees straightforward and, above all, efficient customer service.

Around 3,000 hours of speech required

However, the route to perfect voice recognition is certainly not a simple one. This is because the learning process for the AI software is very time consuming: Swisscom’s speech recognition solution identifies the respective standard German counterparts for Swiss German expressions. The developer then checks the translations produced and provides the system with feedback as to whether or not they were correct. The algorithm is constantly learning from this feedback. Over time, the system learns to also handle the various different dialects. Swisscom is starting with the dialects spoken in Zurich and Bern, which are the most common. Taking this as a basis, Swisscom is also working on the less widely spoken dialects.

For the development phase, Swisscom has teamed up with researchers from IDIAP – the independent research institute for artificial intelligence in Martigny, Switzerland. This institute specialises in the development of voice processing systems and is involved in the technical implementation of Swiss German voice recognition. However, for this to work, it does not just require expertise; it also requires data. And a vast amount of data at that: “for our open-domain model – a system that, like Siri or Alexa, can understand entire sentences – around 3,000 hours of speech need to be transcribed and processed,” explains Egolf.

For that reason, the AI is also being trained using data from within our own company: Swisscom employees are providing voice samples, which can be used to improve the system. Initial tests, such as the “Heidi and Peter” challenge, during which Swisscom employees were able to provide their voice samples, have gone well; however, this is just the beginning: “we are perfecting the system by testing the initial prototypes using even more data,” explains Egolf. This means that the solution will soon have mastered the many dialects of Swiss German – regardless of whether you call a torch a “Taschepfunzle", a “Taschelampe” or a “Saggladäärne”.

One-Stop-Shop for AI

The Swisscom competence centre for applied artificial intelligence offers companies everything they need for the quick, successful implementation of projects in all areas of artificial intelligence, from consulting to the right technology to integration.

> To the offer

Hand with smartphone


Would you like to regularly receive interesting articles and whitepapers on current ICT topics?

More on the topic