Human and machine-based data labelling

Data Labeling

Who does it better: a human or a machine?

Many companies are sitting on a mountain of uncategorised data. “Data labelling” is, however, important for artificial intelligence to function. Any yet who creates a better quality of data – the human or the machine?

Text: Adrienne Fichter, images: © Keystone,

Digitalisation is rapidly changing the world we live in, and the prophets of doom never get tired of warning us about it: robots will replace manual workers, chatbots and virtual assistants will become the customer advisors of tomorrow, autonomous cars will present a new threat to the livelihoods of taxi drivers. And yet the world doesn’t actually function in such a straightforward, black-and-white way. This is because data-driven business models also create new forms of employment: for example, sorting and categorising unstructured data.

Data is accessible to AI when tagged correctly.

The new digital industrial worker

Much of today’s data simply cannot be read by artificial intelligence. Market researchers assume that only 20 percent of the business data in companies can be sorted by machines. The remaining 80 percent is unstructured, and cannot, as a result, be processed using automation. This means that companies are sitting on a wealth of digital data that they are unable to leverage. Examples of this include documents saved as different file types, e-mail histories or scanned correspondence. This immense body of text, as well as audio and video material, all needs to be translated into metadata. This is the only way for artificial intelligence to identify, for example, whether an image shows a horse or a cow, which terms are being used in an audio recording, which topic is being addressed in a newspaper article and which emotion is being expressed in a tweet. Metadata is therefore indispensable for adaptive software.

This data will continue to be processed by humans. With manual categorisation good data quality can be achieved, according to Marc Steffen, Head of Product Design at the Artificial Intelligence & Machine Learning Group at Swisscom. The head of the Watson computer program at IBM, Guru Banavar, is propagating a radical idea within this context: Employees who have lost their position due to the automation of their job should be retrained as digital blue-collar workers, i.e. data industry workers. This would allow them to perform “data labelling” in future – and keep them in work.

“I teach machines to identify high heels on a photo”

The quality of data becomes even better when not just individual specialists, but rather thousands of people, identify content correctly and tag it. The first crowdsourcing concepts have already become established on the market. New providers such as CrowdFlower or Mighty AI offer community-based categorisation as a service, among other things. The different members more or less allocate topics to individual categories using a smartphone app “as a side job”, when they have time to spare or are out and about. One of the data workers explains her job in a promotional video by Mighty AI: “I teach machines to identify high heels on a photo”

However, data categorisation cannot always be left to outsiders. “Specialist knowledge is sometimes required for tagging the data,” as Steffen explains. “And some data is simply too sensitive to have external staff work on them.” Nonetheless, data labelling is highly promising, even with in-house employees – as long as the right incentives are offered. And this doesn’t necessarily have to be the pay: “It can be motivation enough for employees tobe doing something good for others via data labelling,” as Steffen explains. For example, for barrier-free access: When structured data can be used to teach artificial intelligence to describe the surroundings, blind people can benefit from this. Or the hearing-impaired, when the spoken word is converted into text in real time.

The work isn’t necessarily repetitive. The key to this is “gamification”: Developers are increasingly pursuing a playful approach with their labelling tools. This ensures the work remains interesting for the user. And variety is very important: This allows the same tool to be used for different task areas – from tagging to voice recognition to reading texts aloud, users are given a variety of different tasks to complete.

In the meantime, software has emerged that helps with the data cleansing process. However, these “mining tools” are still lagging behind humans: people are more familiar with the specific context, Marc Steffen is convinced. The interplay between human and machine will be decisive. When the pay is good, a quality of data can be achieved that even machines are unable to provide, as a critical web developer writes in a company blog published by the data labelling provider Explosion AI. And Richard Socher, data scientist at the Salesforce company, also gives preference to humans when in doubt. In a tweet, he writes that you shouldn’t waste too much time with the analysis of machine-learning problems, but rather work on ensuring the data is clean. By training a human for this task.


For Salesforce data scientist Richard Socher, it is not machine learning that is most important, but rather clean data storage.

Data Labeling with Swisscom

The Swisscom Competence Centre for Applied Artificial Intelligence develops data labelling tools, among other things. Swisscom provides advice to customers on potential AI applications and on project procedure as a full service. Furthermore, customer data is evaluated in order to develop a suitable solution and integrate it into the respective system – including the labelling tool customised especially for the customers. This allows the user to categorise data and train the AI application.

More on the topic