On 28th April 2022, Credera delivered its first AWS Immersion Day to a full room of Data Scientists, Technologists, and Senior Leaders at AWS’ London headquarters. The event: ‘Data Products, Pipelines & Pitfalls’ covered:
Core AWS analytics services and lessons learned from deployments in complex enterprise environments
Hands-on machine learning (ML) and data pipeline workflow creation in AWS
Operating models to unlock analytics value creation and maintain a focus on cost
Throughout the day, we had great engagement from both our customers and AWS partners alike, further building upon our existing AWS Partner relationship and recent AWS Data & Analytics Competency announcement. Credera invests in our partnerships, and we work closely with AWS to bring the strength of both parties to support our mutual clients at regular events such as this.
The event itself was facilitated by a multi-disciplinary Credera team, with experts in Data, Engineering, Architecture, Cloud, and Agile bringing a wide range of perspectives and client experience to the discussion.
Here are some of our reflections from the event:
Operationalising ML is a key challenge
Machine learning was the topic which piqued most interest on the day. Several customers had started making use of AWS ML services such as Sagemaker, and AWS Senior Architects reviewed the wider proliferation of ML capabilities into AWS services.
However, discussions showed that wider engagement across business is still required to better amalgamate technology, processes, and people with the right model for scaled delivery. Credera see model non-functional requirements such as traceability, repeatability, explainability, and fairness as differentiating organisational goals, whilst the barrier to ‘doing ML’ is increasingly lowered by technology advances.
Measure data product success in the ability to adapt to change
Even optimally designed data products will fail periodically. As pipelines become more complex and increasingly integrate ML into flows, inference and lineage through data value streams become more complex. Responding quickly to issues when they arise is key; when data is invalidated, it is often not just the data but the trust in that data which suffers.
At Credera, we see DataOps as providing a framework to improve communication and automation of data flows from data managers to data consumers in an organisation. Taking a tangible example of this, the contrasting requirements of a test environment for a data scientist vs. a test environment for an engineer was a point of discussion, as was the interplay between trust and control in the creation and upkeep of such environments.
Visualise your data flows
The hands-on workshop element made use of AWS’ managed service offering for Apache Airflow to create a Directed Acyclic Graph (DAG) of the data pipeline run. The act of visualising the relationships and dependencies of a run got great feedback from the group, and points to the power of visually depicting model flows. Moving from this, Credera spoke to a client use case where complex model dependencies had been aggregated into logical groupings before visualisation, to ease visual clutter of such representations.
Visual representations of data pipelines were also introduced in the context of simple monitoring dashboards; simple information radiators made available to all stakeholders. As one customer put as a takeaway from the day, ‘help others to help themselves.’ We could not agree more.
The ongoing importance of CI/CD pipelines
Having automated processes for system build and testing is a cornerstone for successful enterprise data product delivery. It requires daily focus within delivery teams to ensure testing frameworks stay relevant as new features are delivered. This will involve tackling the issue of representative test data in lower environments; a hard problem to solve in some organisational contexts but never an excuse for no data at all.
Implementation decisions can have large cost impacts
Perhaps unsurprisingly, the cost control section of the event wasn’t initially greeted with the same enthusiasm as machine learning. However, there were some key takeaways within this domain.
Time was focused on savings to be made in better design of analytical workloads, given compute will almost certainly be the main cost driver on a data & analytics project. A worked example showed a 5000x cost difference for a simple query based on how S3 objects are compacted, stored, and partitioned alone.
At Credera, we see best returns on investment when a collaborative relationship is fostered between technology teams, knowledge workers, and finance teams. Tangible efficiencies can be found when simple fixes are identified and make their way onto product backlogs in an iterative manner.
Lessons learned the hard way
Some of the most interesting discussion points from the Immersion Day came after reflecting on where implementations had gone wrong - whether that had been assumptions leading to project delays, simple implementation decisions leading to large cloud cost spikes, or deliveries not finding the right balance between security and usability.
Perhaps the biggest take home for building successful modern data products is to adopt a growth mindset in how you deliver, so that project learnings are sought (and hopefully taken) from the inevitable mistakes which will happen on your long and winding journey to ‘data nirvana.’
Interested in attending?
Whilst this was our first Immersion Day, we certainly do not expect it to be our last, with planning already underway for another Credera event with AWS later on in the year.
If you are interested in an Immersion Day and would like to know what options are available to suit your context, please get in touch.