The models of Provenance, Lineage and Chain of Custody are used in fine art to determine when a piece was created, the sequence of locations where it was held, how it was touched along the way, and who has owned it since creation, all with the purpose of authenticating the piece. What does this have to do with boring data?
It turns out many decisions which affect our daily lives are made using a single final result – or score – which is derived from many other pieces of data. What if one of those pieces of data was wrong or stale? This could lead to “Bad Data”, and the consequences can range from the inconvenient to the catastrophic. We must understand the data components used to calculate a final number to insure the result is valid and current; this is why we need to adopt the models of Data provenance, Data Lineage and Data Chain of Custody, and make them an intrinsic part of any data driven decision.
Let me start with a few Examples:
The cost of “Bad Data” ranges from TDWI (The Data Warehousing Institute) estimate of $611 billion each year for U.S. firms, to IBM’s $3.1 trillion per year figure, either figure is simply staggering, not to mention the individual lives affected by this.
The causes of Bad Data typically fall into these categories:
The right solution needs to address all these issues under the umbrella of Data Governance, and it must provide a full audit trail to record and verify all events that could change every piece of data going into a meaningful calculation. It must enable enterprises to have the proper tracking and monitoring of data via Data Provenance, Data Lineage, and Data Chain of Custody.
Data Provenance refers to the “origin” and “source” of data – where a piece of data came from and the process by which came to be in its present state.
Data Lineage is the process of tracing and recording the origins of data and its movement between databases or systems; it tracks the data life cycle from its origin to its destination over time, and what happens as it goes through diverse processes.
Chain of custody refers to the indelible record that captures the original data, who may have accessed or modified it during its lifetime, records how the data changed, and where and when there was a transfer of possession.
Data provenance needs to allow the user to see how a piece of data flowed through the system, replay it at any stage in the flow, store what happened to the data before and after key stages, thereby simplifying data flows that are often large, complex directed graphs involving transformations, forks, and joins.
A great solution to solve this problem came from an unexpected source, the National Security Agency (NSA). The NSA could not find a commercial solution which had at its core the data governance, security and audit capabilities they needed to move massive amounts of data securely from multitude of sources. So they decided to implement it in-house more than ten years ago. The project is called Apache NiFi, and it was submitted to the Apache Software Foundation in November of 2014 as part of the NSA Technology Transfer Program, making it an open source software project.
Apache NiFi was implemented to solve two basic problems: First, move massive amounts of data, from many sources and varieties, securely and effectively; and Second, to have the embedded Data Governance built directly into the system to trace the data from beginning to end.
Every piece of data that flows through Apache NiFi is listed for chain of custody, lineage and data provenance analysis.
Once a piece of data is chosen, further inspection of it’s lineage can be viewed. The picture shows how a piece of data was received, forked and routed between systems.
By furthering inspecting this flow, one can gain provenance information on how that piece of data was handled and processed along the way. This means full knowledge where the data came from, who modified it along the way and how the each reported number is calculated.
Data is only reliable when the sources and process used to create a result set are traceable, reproducible, and visible to those responsible for the results. This requires an infrastructure designed from the ground up to implement the proper Data Governance and track data through all the transformations, from source to end result, and guarantees that the integrity and reliability of the provenance records cannot be rewritten.
To quote Tim Berners-Lee: “Data is a precious thing and will last longer than the systems themselves.”