We’ve written before on the evolution of the Modern Data Stack and how data governance is driving toward automation through machine learning and AI. What is perhaps not as exciting is the quality of the data underpinning this paradigm shift in modern data analysis. The innovation around automated data governance relies on the precision of the data in the first instance and further emphasises the importance of good quality data.
With the exponential growth of the modern data stack and the way we interact with and manipulate data, the requirement to enforce good data quality is urgent & important.
What is data quality
Data quality is certainly a broad topic with many permutations and multiple definitions. To quote an article on “What is data quality”
"Fleckenstein and Fellows (2018) refer to high-quality data as data that "are fit for their intended uses in operations, decision making and planning". In a similar vein, the National Institute of Standards and Technology defines data quality as: "the usefulness, accuracy, and correctness of data for its application"
It becomes clear that the use case of the data correlates to its quality. And without a clear definition or understanding of data quality, we will be unable to measure data quality, fix issues or reap the benefits.
One particular element that compounds the difficulty of ensuring data quality is the vast and disparate volume of data being generated. Think of the various types of data that might be encountered, from structured, semi-structured to unstructured data.
"Rather than being devoid of any structure, these data have an inner structure. However, this is either unknown prior to data collection, or we don’t have a predefined data model when collecting data."
Snowplow offer 2 characteristics used to define data quality and how to consider it, from their guide to better data quality
"At a basic level, we define data quality using two key characteristics: accuracy and completeness. Whether you realize it or not, being able to trust - and actively trusting - your data undergirds your entire approach to data analytics."
In summary of the definition of data quality; with the sheer magnitude and diversity of data being generated; it is essential to understand “what good looks like” for each variable data type according to your business needs.
Why is this important now
As the growth and adoption of the modern data stack has evolved, we are quickly arriving at the stage where there is simply too much data for us as a human workforce to maintain.
AI and machine learning capabilities are ever present and expanding. Applying these methods and technologies to data quality and data governance is essential in order to properly enforce good data quality.
"Given the volume of information available and the complexity of modern data sources (including unstructured data), it is impossible (or at the very least impractical) for a human workforce to keep up."
There are a plethora of platforms and tools that are focussed entirely on building connectors for multiple data sources, ensuring data can be easily moved and accessed throughout the data lifecycle. This goes on to create an ever increasing void where data quality and data governance are overlooked or not given enough priority.
"Fundamentally, as data becomes central to business, data collection and use becomes less about moving it from one place to another in a specific order and more about being able to collect data from anywhere in real time and perform data validation and enrichment."
Data consumers lack the context needed to navigate the ever-growing data footprint inside their organizations, resulting in a significant distrust in data.
There has never been a more urgent need to prioritise data quality, data governance and the automation of the discipline. At the end of the day, poor data quality will negatively impact the bottom line of any organisation.
Data quality through schema management
At Trackplan our vision is to democratise & automate data governance and help organisations build trust in their data. We’ve begun this journey with our most recent integration with Snowplow Analytics to provide a single source of truth for behavioural data tracking via schema management at the validation layer.
With this integration Trackplan empowers data teams to prioritise and visualise any errors or discrepancies pertaining to data quality, right at the point of ingestion:
- Manage schemas and metadata for Snowplow Data Structures through a collaborative tracking plan with an intuitive UX.
- Create data structures and define your data types, sources, properties and validations used to enforce data quality.
- Sync your latest changes to the schema registry. Manage deployments to development and production environments.
Schema management is just the beginning of the evolution of data governance that is now essential in order to effectively leverage the modern data stack.