Managing data in the lakehouse: An introduction to the data lakehouseDaniel Amner
Organisations have been performing data analysis for many years. As an increasing number of these organisations undergo large-scale digital transformations, the volume and speed of data generated has increased dramatically as a result. Whilst the data warehouse is considered an essential piece of any business strategy, organisations are increasingly beginning to recognise a number of its associated challenges, including:
- Increased cost: It is expensive to store large amounts of data within a data warehouse.
- Performance issues: Data warehouses are unable to handle semi-structured and unstructured data natively.
- Longer change times: Inflexible structure can lead to long change times.
Whilst appearing to be the next best possible solution, the data lake has also been seen to carry its own associated challenges, including:
- Non-ACID compliance: Updates and deletes are complex operations.
- Lack of data quality: Enforcing data quality is difficult.
- Inconsistent reads: Non-isolated writes can result in incomplete reads.
This gives rise to a third approach which promises the best of both worlds: the data lakehouse. The data lakehouse is emerging as the successor to existing solutions as it combines the flexibility of data lakes with the structure and control of data warehouses.
First generation data lakehouses
The first generation of data lakehouses allowed data warehouses to query data stored externally. Examples include Redshift Spectrum, external tables for Synapse SQL, or BigQuery cloud storage external data source, which allow data stored in an object store to be queried via the warehouse using Structured Query Language (SQL).
However, this initial approach still required the running of an expensive data warehouse, and the challenges associated with the data lake remained. Now, there is a second-generation available that aims to solve all of these challenges at once.
Read next: The first generation of lakehouse uncovered
Second generation data lakehouses
The second generation of data lakehouses aims to remove the need for a dedicated data warehouse and provides the structure and controls that are the cornerstone of robust data modelling and governance. Some key features of a data lakehouse include:
- ACID transactions: Modern data platforms are required to support multiple reads and writes simultaneously. The data lakehouse ensures consistency by enforcing ACID transactions, allowing users to submit queries whilst data pipelines write new data.
- End-to-end streaming: The data lakehouse supports near real-time ingestion of streaming data. As data continues to be written, dashboards can be updated rapidly whilst leveraging the complete historical data record to display a comprehensive view.
- Decoupling of storage and compute: Data is stored separately from compute resources. Using cloud object stores allows near-infinite storage and the ability to use commodity hardware, driving down costs and increasing scalability as a result.
- Schema enforcement and evolution: The data lakehouse allows the creation of traditional data warehouse schema models, such as star or snowflake schemas. Enforcing these schema promotes data quality and minimises pollution of data stores. Schema evolution allows your data to change whilst protecting the integrity of existing data.
- Business Intelligence (BI) support: A typical BI data pattern uses a data lake for rapid ingestion and transformation before it is loaded into a data warehouse for use by BI tools. By allowing the BI tools to query the data directly, the data lakehouse removes the requirement for two copies of the data. Information is made available more readily by eliminating the need for loading into the data warehouse.
- Openness: Storing data in open formats such as Parquet or ORC avoids lock-in to a particular solution and allows for easy migration of your data. Using these open formats means that many tools can work natively with your data.
- Unstructured data: New data use-cases require data platforms to work with diverse data types that traditional approaches cannot handle. Images, videos, and audio data can be stored and accessed by the data lakehouse.
- Machine Learning governance: As an increasing number of organisations start to leverage Machine Learning (ML) at scale, it is crucial to have a repeatable build and training process. Data lakehouses allow for versioning of data, and data scientists can use this to repeat and compare model builds.
These features provide a powerful alternative to existing technologies. However, as a relatively new technology, the creation of a data lakehouse requires engineering teams to create bespoke implementations.
Implementing a data lakehouse
At present, there are three existing open-source libraries that allow for the creation of a data lakehouse:
Once the initial step of schema design is complete, integrating these libraries into Spark pipelines requires only minor changes. With data written to object storage or Hadoop Distributed File System (HDFS), organisations can utilise existing infrastructure without expensive additions. Alternatively, pipelines can use pre-configured Spark clusters offered by cloud providers or a Kubernetes cluster.
It is important to note that these libraries by themselves do not provide security, auditing, or access control, nor do they provide data cataloguing and quality metrics. These additional components will need to be provided by integrating commercial offerings or built by engineering teams to deliver a full enterprise solution.
The second generation data lakehouse is built to scale efficiently. Using cloud object stores allows your data storage to scale infinitely whilst remaining cost-effective. With queries usually performed on Spark clusters comprised of small instance types, there is no need to provision expensive compute instances. In most cases, the clusters can be terminated when not required. If your usage patterns allow, it is also possible to utilise cloud infrastructure to minimise costs by using spot instances or cluster autoscaling.
Building a data lakehouse remains a new approach, and connectors and integrations with existing tools are still being built. Implementing a second generation lakehouse is not a turn-key operation, and organisations will need a dedicated data engineering team to build and run it.
In a nutshell
The data lakehouse provides an alternative architectural pattern to the existing data warehouse and data lake solutions. Built to solve the challenges organisations face in scaling their data and AI use-cases, they empower data teams to move quickly whilst maintaining quality.
Whilst implementing a data lakehouse is only one part of building a successful data and analytics platform, it provides a stable foundation that will allow organisations to accelerate their digital and data transformations at pace.
If you would like to learn more about the topics discussed in this blog, please get in touch with one of our experts.