The data lake house: the next generation of data warehouse uncovered
We investigate some of the recent announcements from Re:Invent and how they are influencing the architecture patterns of data warehouses in the cloud.
At this year’s Re:Invent, there was a considerable focus on data. Indeed, 30 of the 77 announcements were data, analytics or machine learning related.
Data continues to be a focus area for all large organisations, many of whom have struggled to scale their operations in line with the growth of data. In the past, they have typically relied on expensive and proprietary on-premise data warehouse solutions to store and analyse their data. These products are not only eye-wateringly expensive, but do not measure up in the petabyte scale world we now live in.
Enter the cloud. The promise of both elastic and scalable compute & storage ought to make this a much easier problem to solve. Indeed, many organisations have already made the jump to cloud-native data solutions, albeit with varying levels of success.
Here we explore the latest set of AWS announcements to see how a combination of the cloud and data can help solve organisations' biggest data challenges.
Data Lake + Data Warehouse = Lake House
A new pattern is emerging from those running data warehouse and data lake operations in AWS, coined the ‘lake house’. In reality, this means allowing S3 and Redshift to interact and share data in such a way that you expose the advantages of each product.
An AWS lake house allows you to achieve the holy grail of:
Cheap & durable data storage
Independently scalable compute, capable of massively paralleled processing (MPP)
Standard SQL transforms
Performant SQL querying regardless of concurrency
The above has only really become possible with the latest set of Re:Invent announcements.
Andy Jassy announced in his keynote that Redshift supports the offloading of data to S3 in parquet format with partitioning based on column data supported. This unlocks the use of Redshift as a transient SQL transform engine, with the ability to offload the results back to S3 in an analytics optimised format for consumption by SageMaker (machine learning), Athena (ad-hoc querying) and EMR/Glue (ETL).
Previously the offload was only possible to text files. The parquet format speeds up offload by two-times and consumes up to six-times less storage when compared to the same data in text format.
Although not new this year, Redshift Spectrum is a key part of the lake house architecture. It allows standard SQL queries to be submitted against data stored in S3 via the Redshift cluster. This is achieved by distributing the query across many spectrum nodes to achieve highly performant MPP (massively parallel processing) that scales to petabytes. Spectrum is also fully managed and serverless, charging only for the resources consumed on the query.
Redshift RA3 instances
The new RA3 instance type allows you to scale Redshift data storage and compute needs independently, meaning you pay the lowest possible price for managing your combination of data and workload. These instances are built on the new AWS Nitro controllers, delivering near bare metal performance, and supporting high bandwidth network connectivity. Amazon are advertising a two-times performance improvement on the previous generation Redshift instances.
Redshift concurrency scaling
As of March 2019, Redshift can handle peaks in workload by automatically scaling additional transient capacity to process the concurrent queries and maintain consistent performance. This means the Redshift cluster can be sized for average workloads and not the high-water mark of peak requests. When enabled, the cluster will accumulate one hour of ‘concurrency scaling credits’ for every 24 hours that it runs. Amazon predict that 97% of customers will get the benefits of concurrency scaling without any additional charges.
What does the lake house concept mean for my AWS data warehouse?
By virtue of its architecture, the lake house follows an ‘ELT’ paradigm (extract, load, transform). Data is loaded into S3 in the raw format, and typically only when requested by the customer is data transformed into a derived format for further use. This enables you to onboard new data fast, without the constraint of teams building ETL pipelines. It also means that effort is not wasted transforming data feeds that will never be used.
In the lake house model Redshift becomes the transformation engine of choice. This is welcomed by many organisations where SQL is often the ubiquitous language of the data analyst, and legacy transforms already written in SQL and can be reused.
Typically, there will be a long-lived Redshift cluster acting as the ‘traditional data warehouse’ to support the querying of data across Redshift and S3. Additionally, transient clusters will be spun-up to support transformational workloads as required, writing the output back to S3 in parquet format before being terminated. This gives data analysts access to large scale compute on demand at a fraction of the cost relative to running similar infrastructure on-premise.
Data becomes fluid, moving between S3 and Redshift depending on how frequently the data is accessed and which other AWS services might want to query the data. This flexibility allows companies to choose the optimum trade-off between cost and performance when storing their data.
In a nutshell
The executive summary is that it is not an easy journey, and the AWS cloud is by no means a ‘turn-key’ solution when it comes to data warehousing or data analytics.
However, AWS continues to strive towards making cloud native data warehousing/data analytics more accessible and easier to manage. Architectural patterns will continue to evolve as the product set develops, but the lake house will play a key part in the architecture of AWS data warehousing for some time to come.
There are still recognised pain points with data warehousing & analytics in AWS, not least S3 ‘eventual consistency’ which still presents challenges when loading data from S3 into RedShift.
One thing is for sure: these cloud offerings continue to be increasingly attractive when compared to the data warehouse and data analytics solutions of yesterday.
Credera has great experience of delivering data warehouse & analytics solutions both on and off the cloud. Read one of our most recent case studies: