Today's data economy has grown exponentially in recent years and shows no sign of slowing down. Data however is only as good as the tangible value it can deliver. For all the data gathering, centralising and real-time data delivery, there is still a proportionally large amount of effort and time required to validate and govern the raw data. This is quantified in a report from Forrester where they say,

“nearly one-third of analysts spend more than 40 percent of their time vetting and validating their analytics data before it can be used for strategic decision-making.”

To quantify further the monetary aspect of this, Gartner research suggests

“the average financial impact of poor data quality on organizations is $9.7 million per year.”

At its core, the cause of data quality issues stems from the inability to see a clear ‘big picture’ or ‘single-source-of-truth’. Data is scattered across such a vast array of applications, tools, advertising platforms and 3rd party providers, it is no wonder so much time and effort is spent on gathering and validating these data sources.

How this is [partially] solved today

The concept of data centralisation is not a new one, and in recent years has been distilled by two different concepts. A CDP, or Customer Data Platform

“A CDP builds a complete picture of your customers on an individual level. It collects 1st party customer data (transactional, behavioral, demographic) from a multitude of sources and systems, and links that information to the customer that created it.”

and a DMP, or Data Management Platform

“A unifying platform to collect, organize and activate first-, second- and third-party audience data from any source, including online, offline, mobile, and beyond.”

This is the culmination of the shift over time, in tracking individual customer behaviour rather than cookie and session data that would otherwise need to be manually connected back to a single user. CDP’s collect various sources of customer data and centralise this for use later down the data pipeline when it is leveraged for more executional purposes, such as marketing automation, or analytics & reporting tools. A DMP does similar work in data centralization but from a broader variety of sources & will typically be better equipped to handle unstructured data.

Although these multiple and fragmented data sources are now “centralised”, there is little in the way of validation, governance or stewardship of this data happening before it’s sent downstream for actionable use cases. So whilst the centralisation occurs, little is done on the vetting and validation of this data.

Built in Data Governance

More recently, and as both solutions mentioned above have evolved to meet the demands of data quality & data governance, each platform has expanded their offerings on the subject.

Segment released Protocols in late 2018 and have developed the product in recent years as a means to enforce standardisation of the data being collected by Segment. In summary of the output, Protocols will provide guidance, templates and best practice opinions to help create a tracking plan. This plan can then be stored at a workspace level, and be connected to one or more sources in Segment with rules and validations set up and applied via controls to then transform data accordingly.

Looking at mParticle and their built in solution Data Master, there is a similar approach. The data plan is the starting point of enforcing data quality within mParticle, neatly underpinned by their data planning API. Once the planning is done you’re able to activate this by verifying incoming data against expectations; validate and monitor your incoming data stream against these expectations; and ultimately begin to block unplanned data.

Snowplow analytics are focussing more specifically on the standardisation and validation of behavioural customer data by way of their schema registry.

“Every collected data point is validated against your configured schemas, and any bad data is surfaced to you so you can proactively monitor and manage data quality.”

Snowplow Data Journey © Snowplow

Snowplow analytics offers a schema registry with built in validation rules and alerting that proactively flags errors with your individual data structures.

Further downstream, other tools in the ecosystem, such as Mixpanel and Amplitude are increasing efforts on a smaller scale to tackle data quality issues and provide solutions around data planning and standardization.

Lexicon within Mixpanel allows you to publish your tracking plan or data dictionary to their lexicon with schemas via API. This will ensure consistency and standardisation for the data coming into Mixpanel.

Amplitude offers a very similar solution by way of their Taxonomy API and their Govern feature.

“Govern gives you a central location where you can create, edit, manage, block, and delete events, event properties, and user properties”

Essentially providing a built in solution to enforce consistency and manage event data coming into Amplitude.

The introduction of schema management

Just as CDP’s and DMP’s have solved the problem of fragmented data sources being unified under a single view, there is a need for the same centralisation of the various schema & data structures prevalent across all the tools & platforms mentioned above.

A tracking plan needs to be the schema management platform that integrates with the modern platforms surrounding behavioural data tracking.

Key benefits of platform agnostic schema management

  1. Abstraction of the variety of siloed “data planning” and “data governance” APIs
  2. Collaborative schema management means tracking design for all teams, no longer just data engineers
  3. Data quality will improve with validations that are synced to and enforced by the tracking tool
  4. Data governance is at the core of schema management
  5. Improved workflow from planning to validation, and centralising good quality data