Building efficient and scalable data platforms in AWSArchita Regan
It can be difficult to know where to start when building data platforms. In this article, we provide an overview of the facets to help guide the design of efficient and scalable implementations.
The growing success of data driven companies like Amazon, Facebook, and Netflix has made it clear that the ability to generate and harness data effectively is a key differentiator. Data is no longer just a by-product of the IT system; it has become a primary asset to exploit. Many companies are keen to rearchitect their data platforms to enable rapid innovation, improve customer experience, and increase revenue. There has been a recent proliferation of cloud technologies to support this paradigm shift. Data platforms are central repositories for all data that handle the collection, cleansing, transformation, and application of data to generate business insights.
Before building a data platform, it is necessary to understand the specific problem that the organisation is looking to resolve. Using the agile methodologies to capture user stories, features, and acceptance criteria is a great way to start. Begin by mapping out what needs to be done, which tasks are the priority, and note where there are inter-dependencies.
Understanding the data
To help you to select the right technologies, it is important to have a solid understanding of the data that you are looking to transform and how it exists within the business. A clear view of the data type, size, structure, relationships, and dependencies on other data sources is key. The quality of the data should also be assessed to ensure its accuracy, completeness, consistency, and where it is duplicated, if applicable.
Establishing the architecture guardrails
Once you have an understanding of both the business requirements and the data, you should then develop a logical diagram of the architecture to depict the key elements and supporting functionality of the data platform. This will help to ensure that key stakeholders are aligned and that any gaps are highlighted earlier on in the process.
Before you select the right technologies, it is helpful to divide the architecture into key functional components such as ingestion, processing, storage, analytics, and visualisations.
- Ingestion: Ingestion is the process of moving data from the source into the data platform. The data can be in different formats and come from various sources: gathering this information will help to drive the technology selection process. AWS Storage Gateway, for example, can be used to integrate an AWS S3-based data lake with legacy on-premises platforms; AWS Snowball is useful for transferring huge amounts of data from on-premise storage to AWS; and Kafka or Kinesis can be used to ingest from real-time streams.
- Processing: Processing involves manipulating and transforming the data to produce meaningful information. AWS offers a number of tools for processing, including AWS Elastic Map Reduce (EMR) for big data processing across distributed compute, serverless AWS Glue for extract, transform, load (ETL) jobs, and Lambda or EC2 for simple transformation.
- Storage: The right storage option depends on the data characteristics and how it is to be consumed. AWS S3 is a suitable option for object storage as it is cost effective and provides high levels of durability. Relational databases are often suited to transaction-oriented systems, whilst no-SQL databases such as DynamoDB are great for semi-structured data storage. For data that requires parallel analytical processing, you might consider a data warehousing solution such as Redshift.
Storing metadata to form a data catalog can help data analysts find and understand datasets that are relevant to specific business questions. Hive Metastore (HCatalog) and AWS Glue data catalog contains information about data assets that have been transformed into formats and table definitions. This can be accessed using analytics tools such as Amazon Athena, Amazon Redshift, Amazon Redshift Spectrum, and Amazon EMR.
- Analytics and Visualisation: Analytics involves bringing data together to create new business insights and discover trends. We can perform data analysis by slicing, joining, and aggregating the data, or leveraging advanced techniques such as machine learning and natural language processing. AWS offers a multitude of AI services, including AWS SageMaker for machine learning. Users can visualise data using tools like AWS QuickSight, which provides machine learning powered business intelligence through interactive dashboards.
The cloud offers a number of additional benefits when compared with the previous generation of platforms. To minimise cost and maximise flexibility, you can utilise cost-effective object storage, serverless technologies, transient clusters, auto scaling groups, and decouple your compute and storage to scale independently.
When creating a data platform, it is better to start small and iterate, rather than spending a large amount of time designing up-front. One of the benefits of using cloud technology is the flexibility that it provides to adapt to change. An iterative approach enables you to develop and demonstrate a minimum viable product to stakeholders and use the feedback to further drive the design. It also allows you to prototype various solutions to discover which technologies and practices best suit your use case.
Automation enables rapid innovation and supports standardisation of operations. Terraform is an open-source infrastructure as code tool and can be used to provide infrastructure on the cloud. When using AWS, CloudFormation templates can enable more rapid development. For GitHub projects, GitHub actions can support continuous integration and automate software builds, tests, and deployments.
Security is key when building cloud-native platforms. To begin, hardening machine images can help to prevent attacks on the server. Hardening involves disabling unwanted services and ports, restricting access on the server, and removing unused software. It is important to understand the sensitivity of the data so that the correct measures can be taken to protect it. This will help you to decide upon the appropriate level of encryption and the type of key management system to use. To protect data in transit, you can encrypt sensitive data prior to moving and/or use encrypted connections (HTTPS, SSL, TLS, etc). To protect data at rest, you can encrypt sensitive data prior to storage or encrypt the storage drive itself.
AWS Identity and Access Management policies, security groups, and Access Control lists help to manage access to AWS services and resources. Adopting a DevSecOps approach, Security as Code enables the implementation of these controls and ensure best practices with ongoing flexible collaboration between release engineers and security teams.
Monitoring and logging
Once a data platform becomes operational, you can measure its performance and alert operations so that problems are remediated before the customer is impacted.
AWS’s CloudWatch is a monitoring service for AWS resources and applications. It collects and monitors metrics and log files, collects metrics, and sets an alarm. AWS CloudTrail is a web service that records API activity in an AWS account. It focuses on the user, application, and activity performed on the system. The ability of CloudTrail to provide event history of AWS account activity helps in detecting unusual activity, security analysis, resource change tracking, and troubleshooting.
Finally, it is necessary to consider non-functional requirements such as availability, resiliency, scalability, and performance. If trade-offs are required, it is important to understand which non-functional requirements are the most flexible. The AWS Well-Architected Framework provides a great reference for building cloud platforms that will excel in these areas.
In a nutshell
When building a data platform in AWS, it is important to ensure that you understand your business requirements and data characteristics before delving into technology selection.
The best practice is to start prototyping and ensure that you automate with Infrastructure as Code and DevOps practices. Once you have selected the right technology, you must consider security, monitoring and logging, as well as the non-functional requirements such as availability, scalability, resilience, performance, and cost forecasting. As AWS partners, we can help you to build efficient, cost effective, and scalable data platforms using AWS.
If you would like to learn more on the topics outlined in this blog, please get in touch a member of our AWS team.
How to design and implement big data programmes
Best practices for better data insights
Data governance as an enabler: Agile data governance
Managing data in the lakehouse: An introduction to the data lakehouse
Pocket guide: Realising a data mesh architecture