Tech spotlight: Driving data democratisation in your organisation using Dataiku
It’s common knowledge that data analytics are increasingly crucial for business across the board. Generating and storing large quantities of data is now easy to do; the challenges lie in cataloguing, cleaning and leveraging that data to generate useful business insights. This is especially true given the well-documented skills gap: there are not currently enough data scientists available to meet the growing demand.
Numerous companies have been attempting to provide products that can step into this skills gap. Notably, all three major cloud providers – Amazon, Microsoft and Google – now provide some level of machine-learning-as-a-service tools or APIs to make powerful data analytics more accessible. However, there are also some products starting to hit the market that are specifically targeted at empowering people across an organisation to clean data, build models and derive insights.
One product that we have been successfully helping clients to deploy is the Dataiku data science platform. This product has been starting to make some serious waves, reaching Unicorn status in 2019 and then being named a Leader in the 2020 Gartner Magic Quadrant for Machine Learning and Data Science. Dataiku is a platform which focuses on fostering collaborative data science and enabling data analysts without strong coding skills to leverage the toolkit of data science packages available within Python, SQL and R.
It moves data analysis out from individual machines and into a central managed location, removing dependencies on the availability of individual users and computers. Additionally, it provides clear auditability and security configurability, allowing the platform to match your compliance needs.
Dataiku needs to be deployed into a Linux environment, which can be cloud-based (we have helped clients architect this product, deploying it into Amazon EC2 boxes). Once it is up and running, users access the environment via their normal web browser without needing to do any additional configuration or installations. Once in the environment, users work on project ‘flows,’ which record each step in a data pipeline in an easy-to-understand visual way, backed by a git repository. This makes the process from data preparation through to model selection, training and scoring incredibly transparent, removing project dependencies on individual analysts.
Dataiku provides a range of ‘visual recipes’ to carry out common data analytics processes, from relatively simple tasks such as grouping, joining and filtering data up to and including the construction of machine learning models, without needing users to write a single line of code.
However, it also includes Jupyter notebooks, allowing users who are proficient with R, Python and SQL to write their own tools and processes in a familiar setting. Once users have constructed their data flows, Dataiku also provides a range of automation tools which can be used to run pipelines at appropriate times or in response to specific triggers, as well as providing simple interfaces to enable the calculations of metrics and checks and the sending of reports.
Dataiku can connect to a wide range of data sources out of the box, with the ability to configure more connections to organisational systems via APIs. These include the ability to integrate with Hadoop and Spark, to connect with cloud-based object storage on AWS, Azure and GCP, and to link up to a range of common SQL and NoSQL database providers.
Where we have deployed the platform, we’ve also connected it to a backing database; Dataiku can push calculations down to where your data is stored, distributing the computational load and reducing the power and memory which the Dataiku environment itself needs to have available.
While Dataiku does include some in-built capacity to create visual dashboards, this backing database can also be used to handle output datasets and serve them to other tools – for example, to allow users to create reports and dashboards in tools they may already be familiar with, such as PowerBI or in Tableau.
This has then allowed them to share their data insights with appropriate stakeholders. In this way, Dataiku integrates into your existing data infrastructure, providing a one-stop-shop for data analytics and modelling using data from multiple source systems.
In a nutshell
Overall, the Dataiku platform is intuitive to use and has a shallow learning curve on entry, making adoption within an organisation very straightforward. This drives data democratisation within your organisation, making the data and the tools required for data analysis more readily available to a wider proportion of your company. It quickly gives user access to powerful analytics, modelling and prediction tools in a transparent and collaborative way.
For those who want to go further in customising and extending platform functionality, there is a wide range of resources available to help develop skills or apply existing skills to the platform architecture. While there are a range of aspects to consider when installing configuring an enterprise-level installation to meet your business needs, Credera’s experience in this area means that we can help you streamline this process and to get you rapidly up and running.
Credera and Dataiku
At Credera, we have experience architecting, deploying and developing a Dataiku platform at enterprise level. We have multiple consultants who hold Dataiku certifications and are recognised as a part of the Dataiku partner ecosystem.
If you would like to learn more or see a demo, get in touch.