A second digital transformation is taking place, even deeper and more relevant to our lives than the one we have lived in the last decades. This digital transformation is not only concerned with creating digital services, but also with transforming everything we know into digital replicas (as in digital twins, metaverse, IoT (Internet of Things)) and specially with using Big Data analytics to analyze all kinds of data that help us improve our businesses and our lives. The great catalyst for this real digital transformation has been the pandemic that has forced not only businesses but also people to catch up with the latest digital technologies. According to a latest survey study, 67% of respondents acknowledged that they accelerated their digital transformation and that 63% increased the digital budget due to the pandemic [7]. This acceleration in digital transformation has occurred in healthcare (AI-driven advanced reporting, electronic medical records, pandemic predictions, etc.) but also in many other sectors. The use of advanced analytics seeking to understand latest trends due to the pandemic and remote work has gained major importance. Businesses have adapted their digital services and digital strategies to this new reality. For this reason, the growth of the digital market and Big Data analytics is expected to experience even greater momentum in the coming years.
It was not that long ago that Big Data and AI technologies became popular among businesses. All companies wanted to enter this new era of intelligent insights. So, they began to accumulate all kinds of traditional business data but also device data, log files, text files, documents, images, etc. in one place. In the hope that these new technologies would extract smart insights relevant to the business from all this enormous amount of data with little effort. All of this occurred, in part due to the lack of maturity of companies in relation to Big Data technologies at that time and because they did not have a well-defined strategy for the analysis.
The technology adopted by companies to store all this enormous amount of data of all kinds ended up being the Data Lake. This technology can store any type of data, whether structured or unstructured, in its raw format. And it achieves it thanks to the separation of data and the schema that defines it. (Schema-on-read) [6]. Traditionally, business data is stored in structured data systems with a schema imposed when recording data. On the opposite side, data lakes store any type of data in raw format in order to replicate data from various sources to be later pre-processed, aggregated, combined, and interpreted.
A properly designed data lake should consist of three main areas. These are bronze (used to clone all kinds of data in raw format), silver (data is refined: pre-processed, cleaned and filtered) and gold (combined, aggregated for business consumption) [8]. Additional areas could be considered to separate other processes specific to the type of business and needs. Companies are beginning to realize the issues that arise with this type of architecture based only on Data Lake as they begin to experience a series of problems and challenges when they want to analyze data or use it in advanced reports. Data lake is not designed to support transactions or metadata. It requires a series of additional skills to run, manage and govern them. Data lakes do not handle corrupted, incomplete or data that lacks quality well. It is also not designed to combine batch data and streaming processing. They do not consider different versions of data or schema changes. In fact, the latter can make the data completely useless. In addition, some companies decided to regularly schedule full copies of the data sources, consuming more resources to store and process them.
The current reality is that many data lakes have become Data Swamps [11] for many companies. A place where all types of data coexist without users knowing what is being stored, its quality and if it is aligned with the content of the original sources. So, all this makes much of the Data Lakes close to unusable. Recording data without an onboarding process or a view of its potential use makes it even more challenging. From an AI/ML point of view, these Data Lakes, when used to produce advanced models, become a source of Garbage In/Garbage Out as it is known in the jargon. Besides,companies have realized that data grows on these systems at a higher rate than their computer systems can analyze.
The maturity of companies in understanding Big Data analytics has improved remarkably in the last years and they have understood that these Data Lake Architectures do not fully meet the needs for advanced Big Data analytics. That is why many companies have started upgrading their Data Lakes by adding Delta Lake layers to their systems. This is an alternative to the upgrade, which has a higher acceptance.
Delta Lake technology is used by over 7,000 organizations, processing exabytes of data per day. Data Lake 2.0 from DataBricks has recently been released fully open this year and includes many features that makes it Big Data Analytics ready [9]. A Delta Lake in its basic form is a data management and transactional storage layer that extends a Data Lake to provide reliability, quality, consistency, and improved performance [10]. The core technology is based on the Apache Parquet files plus additional logs of all its own metadata.
This technology is continuously evolving but the current key features are:
And many other features that make Data Lakes fully functional and Big Data Analytics ready. In its latest versions, it enables also building a Lakehouse architecture with compute-engines that include Spark and others. The Lakehouse architecture unifies any advanced analytics and Data Warehouses (DWH) cases by combining the best elements of Delta Lakes and DWH to deliver reliability, strong governance, and performance of Data Warehouses and on the other side the flexibility and Big Data Analytics of Delta Lakes [9].
We recently had the experience of supporting a customer that was running a Data Lake instance for several years. The Data Lake did not enjoy of a Delta Lake architecture and was beginning to suffer from most of the problems mentioned above. Both the business and analytics departments had no certainty about the data accuracy and whether the data had been recently updated. In addition, their predictive models were increasingly difficult to maintain, and the related performance was degrading.
After an in-depth study of their current architecture and of the use of data by the stakeholders, we were able to propose a series of solutions that addressed all the customer requirements to fully upgrade the Data Lake system.
Swisscom Data & Analytics helps business customers with advisory, design, integration, and maintenance of analytical information systems such as data lakes, data warehouses, dashboards, reporting and ML/AI solutions based on selected technology from Microsoft, AWS, SAP, open source and more. More than 50 engaged data & analytics experts support our clients in different industries on a day-to-day basis in order to make them true data driven businesses.
Sergio Jimenez is a Senior Data & Analytics Consultant at Swisscom specialized in Advanced Analytics. Since joining Swisscom in 2016, Sergio has worked on numerous projects for multiple customers ranging from Business Intelligence to AI/ML. He has successfully built innovative solutions using the latest technologies.
[1] Big Data Analytics. IBM. Accessed Sep 2022. https://www.ibm.com/analytics/big-data-analytics
[2] Artificial Intelligence. IBM. Accessed Sep 2022. https://www.ibm.com/design/ai/basics/ai/
[3] Machine learning. IBM. Accessed Sep 2022. https://www.ibm.com/design/ai/basics/ml
[4] What is data lake. Microsoft. Accessed Sep 2022. https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-a-data-lake/
[5] Introduction to data lakes. Databricks. Accessed Sep 2022. https://www.databricks.com/discover/data-lakes/introduction
[6] How Schema On Read vs. Schema On Write Started It All. Dell. Aug 2017. https://www.dell.com/en-us/blog/schema-read-vs-schema-write-started/
[7] Big Data Analytics Market Size, Share & COVID-19 Impact Analysis 2022-2029. Fortune Business Insights. July 2022. https://www.fortunebusinessinsights.com/big-data-analytics-market-106179
[8] Medaillon Architecture. Databricks. Accessed Sep 2022. https://www.databricks.com/glossary/medallion-architecture
[9] Open Sourcing All of Delta Lake. Databricks. June 2022. https://www.databricks.com/blog/2022/06/30/open-sourcing-all-of-delta-lake.html
[10] Realizing a Data Mesh: Delta Lake and the Lakehouse architecture. Deloitte. Accessed Sep 2022. https://www2.deloitte.com/nl/nl/pages/data-analytics/articles/realizing-a-data-mesh.html
[11] Data lakes and data swamps. IBM. March 2018. https://developer.ibm.com/articles/ba-data-becomes-knowledge-2/
Sergio Jimenez-Otero
Senior Data & Analytics Consultant
Trouve le Job ou l’univers professionnel qui te convient. Où tu veux co-créer et évoluer.
Ce qui nous définit, c’est toi.