The increasingly massive volume of data being ingested, centralized and transformed amongst such a plethora of tools and platforms has created unprecedented benefit to organisations and businesses. With the rise of the modern data stack and the near limitless process capabilities, has also given rise to the chaos of trying to leverage the value that all this data brings.
Data governance, and the building of trust in this data, has never been of more paramount priority than now.
MDS - The Modern Data Stack
The emergence of the “modern data stack” can be marked by the shift in data warehousing, from on-premises databases to cloud platforms and data lakes that has occurred over the last decade or so.
“The modern data stack is centered on a powerful data warehouse. Data is loaded directly into the warehouse. A robust and reliable transformation layer is used to turn that raw data into dependable and meaningful datasets.”
The launchpad, so to speak, of the modern data stack occurred with the launch of Redshift (beta in 2012 and full release in 2013). Really this was the most substantial change in how database warehousing was architectured and structured.
“This night-and-day difference is driven by the internal architectural differences between MPP (massively parallel processing) / OLAP systems like Redshift and OLTP systems like Postgres”
As described here by Tristan Handy (CEO & Founder of Fishtown Analytics & creator of dbt), the modern data stack was born from the immense increase in processing speed developed by Amazon Redshift
“In short, Redshift can respond to analytical queries, processing many joins, on top of huge datasets, 10-1000x faster than OLTP databases”
Redshift was not the first to leverage MPP, but was the first cloud native data warehouse that gave rise to the modern data stack.
ETL -> ELT -> ELTG
Whilst the modern data warehouse has revolutionised the data ecosystem and given rise to the modern data stack, the fundamentals of the processes have not changed. We are still building data pipelines where you extract data from various sources, transform and store it in a centralised data warehouse and then have it consumed by BI, analytics or even AI/ML projects.
Before cloud based data warehousing, we had to rely only on the ETL (extract, transform and load) process, largely due to the cost associated with scale. Data would need to be more heavily curated, or transformed before being piped into a centralised warehouse.
ETL was traditionally a highly technical process that required a great deal of engineering capability to transform the data before loading it. With such manual effort during this process it has resulted in little consideration for data governance to be a priority. This has been left for other tools to deal with later down the line.
That has all changed in recent years where the capabilities of modern cloud based data warehouses have removed the need for this data transformation to be done before loading data.
“In the modern data pipeline, you can extract large amounts of data from multiple data sources, dump it all in the data warehouse without worrying about scale or format, and then transform the data directly inside the data warehouse – in other words, extract, load and transform (“ELT”)”
With the transformation now happening within the data warehouse, and more common “plug and play” connectors being developed by tools such as Fivetran, Segment, Snowplow; to centralise data; there is a clear need for an automated data governance layer.
“...in a cloud data warehouse centric paradigm, where the main goal is “just” to extract and load data, without having to transform it as much, there is an opportunity to automate a lot more of the engineering task”
This evolves our acronym one step further, where governance and validation is happening directly at the time of loading data… ELTG… extract, load, transform, govern.
The time for governance is now
The modern data stack has grown faster than any other data trend in the market. Batch ingestion, streaming, scalable cloud storage, visualization have all ridden this wave of innovation, and no governance tool has kept up with the needs or demands.
With this wave of innovation, the modern data stack has introduced new complexities, challenges and now pains as organisations ingest, process, centralize and consume more data than ever before. Laws and regulation make privacy a must, and those laws will continue to develop and standardise across the globe.
The time for governance is now. We need to rapidly build data governance tooling and services to catch-up and meet the demands of the modern data stack.
Key aspects of modern data governance
- Must provide and ensure data quality, lineage, cataloging, auditing, metadata and schema management, to the modern data-first company
- Vetting and validation needs to be done in real time at the point of ingestion of source data
- Monitoring across multiple instances of data centralisation and storage from on premise to multi-cloud infrastructure, often combinations of both, is essential
- Automation driven by ML and AI will be critical to match the speed of innovation and growth of the modern data stack
- Needs to be ubiquitous enough to offer solutions beyond only enterprise level organisations