Upgrading a Data Lake

Data Science

Big Data Analytics: Upgrading a Data Lake

Big Data analytics is living a second life. The growth of the Big Data analytics market is exponential and therefore the technologies to efficiently store and process Big Data are evolving rapidly. The size of the market for big data analytics was valued at $240.56 billion last year, and the market is expected to grow between 2022 and 2029 from $271.83 billion to $655.53 billion [7]. Big data analytics aims to examine vast amounts of structured and unstructured data in order to extract and provide highly valuable insights for the business, all by searching for correlations and underlying patterns that are not obvious to humans but can be revealed by AI (Artificial Intelligence), Machine Learning technologies and distributed computing systems.

Big data analytics aims to process huge amounts of structured and unstructured data to gain valuable insights for the organisation. It looks for correlations and underlying patterns that are not obvious to humans but are uncovered by AI (artificial intelligence), machine learning technologies and distributed computer systems.

A second digital transformation is taking place that is even deeper and more relevant to our lives than we have experienced in recent decades. This digital transformation is not only concerned with the creation of digital services, but also with the transformation of everything we know into digital replicas (as in digital twins, metaverse, IoT (Internet Of Things)). In particular, the use of big data analytics, helps to analyse all kinds of data that can help us, our business and our lives to improve.

The big catalyst for this true digital transformation has been the pandemic, which has forced not only businesses but also most people to keep up with the latest digital technologies. According to a recent survey, 67% of respondents said that they have accelerated their digital transformation and that 63% have increased digital budgets due to the pandemic [7]. This acceleration of digital transformation can be seen in healthcare (AI-driven reporting, electronic medical records, pandemic predictions, etc.), but also in many other sectors. The use of advanced analytics to understand the latest trends due to the pandemic and remote working has become increasingly important. Companies have adapted their digital services and digital strategies to this new reality. For this reason, the growth of the digital market and big data analytics is expected to experience even greater momentum in the coming years.

It wasn't so long ago that big data and AI technologies became popular with companies. All companies wanted to enter this new era of intelligent insights. So they started collecting all kinds of traditional business data, but also device data, logs, text files, documents, images, etc. in one place. In the hope that these new technologies could extract insights from all this big data with little effort and be relevant to businesses. All of this happened, either due to the lack of maturity of companies in terms of big data technologies or because they did not have a clearly defined big data strategy.

Grafik Data Lake

The technology adopted by companies to store all these huge amounts of data of all kinds ended up as a data lake. This technology can store any type of data, whether structured or unstructured, in its raw format. And it achieves this thanks to the separation of data and the schema that defines it. (schema-on-read) [6]. Traditionally, business data is stored in structured data systems with a schema that is specified during data capture. On the other hand, data lakes store any type of data in raw format to replicate data from different sources that are later pre-processed, aggregated, combined and interpreted.

A properly designed data lake should consist of three main areas. These are Bronze (to clone all types of data in raw format), Silver (data is refined: pre-processed, cleansed and filtered) and Gold (combined, aggregated for business benefit) [8]. Additional areas could be considered to separate other processes specific to the type of business and requirements. Companies are starting to realise the problems that occur with this type of architecture based only on Data Lake, therefore they experience a number of problems and challenges when they want to analyse data or use it in advanced reports.

Data Lake is not designed to support transactions or metadata. It requires a range of additional skills to execute, manage and control them. Data lakes do not process corrupt, incomplete or low-quality data. It is also not designed to combine batch data and streaming processing. They do not take into account different versions of data or schema changes. In fact, the latter can render the data completely unusable. In addition, some organisations decided to regularly schedule full copies of data sources, consuming more resources to store and process them.

The current reality is that many data lakes have become data swamps [11] for many organisations. A place where all kinds of data coexist without the user knowing what is being stored and whether its quality matches the content of the original sources. All this makes the majority of data lakes almost unusable. Recording data without an onboarding process or a view of its potential use makes it even more difficult. From an AI / ML perspective, these data lakes, when used to create advanced models, become a source of garbage in / garbage out, as it is colloquially known. Furthermore, companies have realised that data on these systems is growing at a higher rate than their computer systems can analyse.

Grafik Databricks

The level of maturity in companies' understanding of big data analyses has improved significantly in recent years. They have understood that these data lake architectures do not fully meet the requirements for advanced big data analyses. For this reason, many companies have started to update their data lakes by adding a delta lake layer to their systems. This is an alternative to upgrading, which has a higher acceptance rate.

Delta Lake technology is used by over 7,000 companies and processes exabytes of data every day. Data Lake 2.0 from DataBricks was recently released in its entirety this year and includes many features that make it ready for big data analytics [9]. A delta lake in its basic form is a data management and transaction storage layer that extends a data lake to provide reliability, quality, consistency and improved performance [10]. The core technology is based on the Apache Parquet files and additional protocols.

This technology is constantly evolving, but the current key features are:

  • Delta tables with ACID transactions
  • Scalable memory and metadata
  • Standardisation of stream and batch processing on a single table
  • Automatic versioning
  • Scheme development and implementation
  • DML database system like operations

And many other functions that make data lakes fully functional and big data analytics ready. In its latest versions, it also enables building a Lakehouse architecture with compute engines including Spark and others. The Lakehouse architecture unifies all advanced analytics and data warehouse (DWH) cases by combining the best elements of Delta Lakes and DWH to enable the reliability, governance and performance of data warehouses and, on the other hand, the flexibility and big data analytics of Delta Lakes [9].

We recently had the experience of supporting a customer who had been running a Data Lake instance for several years. The data lake wasn't fully coping with the Delta Lake architecture and started to suffer from most of the issues mentioned above. Both the business and analytics departments had no certainty about data accuracy and whether the data had been recently updated. In addition, it was becoming increasingly difficult to maintain their predictive models and thus the associated performance was deteriorating.

After an in-depth study of the current architecture and stakeholders' use of data, we were able to propose a number of solutions that met all of the client's requirements to optimise the data lake system.

About Swisscom Data & Analytics

Swisscom Data & Analytics supports business customers in the consulting, design, integration and maintenance of analytical information systems such as data lakes, data warehouses, dashboards, reporting and ML/AI solutions based on selected technologies from Microsoft, AWS, SAP, Open Source and more. More than 50 dedicated data and analytics experts support our customers in various industries on a daily basis to turn them into true data-driven organisations.

About the author

Sergio Jimenez is a Senior Data & Analytics Consultant at Swisscom, specialising in Advanced Analytics. Since joining Swisscom in 2016, Sergio has worked on numerous projects for several clients ranging from Business Intelligence to AI/ML. He has successfully developed innovative solutions using the latest technologies.

References:

[1] Big Data Analytics. IBM.  Accessed Sep 2022. https://www.ibm.com/analytics/big-data-analytics

[2] Artificial Intelligence. IBM. Accessed Sep 2022.
https://www.ibm.com/design/ai/basics/ai/

[3] Machine learning. IBM.  Accessed Sep 2022. https://www.ibm.com/design/ai/basics/ml

[4] What is data lake. Microsoft.  Accessed Sep 2022. https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-a-data-lake/

[5] Introduction to data lakes. Databricks.  Accessed Sep 2022. https://www.databricks.com/discover/data-lakes/introduction

[6] How Schema On Read vs. Schema On Write Started It All. Dell. Aug 2017. https://www.dell.com/en-us/blog/schema-read-vs-schema-write-started/

[7] Big Data Analytics Market Size, Share & COVID-19 Impact Analysis 2022-2029. Fortune Business Insights. July 2022. https://www.fortunebusinessinsights.com/big-data-analytics-market-106179

[8] Medaillon Architecture. Databricks.  Accessed Sep 2022. https://www.databricks.com/glossary/medallion-architecture

[9] Open Sourcing All of Delta Lake. Databricks. June 2022. https://www.databricks.com/blog/2022/06/30/open-sourcing-all-of-delta-lake.html

[10] Realizing a Data Mesh: Delta Lake and the Lakehouse architecture. Deloitte. Zugriff auf Sep 2022. https://www2.deloitte.com/nl/nl/pages/data-analytics/articles/realizing-a-data-mesh.html

[11] Data lakes and data swamps. IBM. March 2018. https://developer.ibm.com/articles/ba-data-becomes-knowledge-2/

Sergio Jimenez-Otero

Sergio Jimenez-Otero

Senior Data & Analytics Consultant

More getIT-articles

Ready  for  Swisscom

Find the job or career to suit you. A career where you can make a difference and continue your personal development.

What you do is who we are.

Go to careers

Go to current cyber security vacancies