Whenever we talk about data lineage and how to achieve it, the spotlight tends to shine on automation. That’s okay, as automating the process of calculating and establishing lineage is crucial to understanding and maintaining a trustworthy system of data pipelines. After all, lineage “Utopia” is to have everything automated, using various methodologies, so that lineage tracking evolves into a hands-off operation without human intervention. Consequently, little is often said about Descriptive, or manually derived lineage — an equally important tool for delivering a comprehensive lineage framework. Unfortunately, Descriptive Lineage doesn’t get the attention or recognition it deserves. If you say “manual stitching” among data professionals, everyone cringes and runs!
In her book “Data Lineage from a Business Perspective”, Dr. Irina Steenbeek introduces the concept of Descriptive Lineage as “a method to record metadata-based data lineage manually in a repository.” Let’s dive deeper into the definition of Descriptive Lineage, learn why it is so important for attaining complete clarity across your data pipelines, and explore its most important use cases.
Descriptive Lineage of the past
When I first started working on lineage in the late 1990’s, the team I was on was narrowly focused on one technology and use case: DataStage, an IBM extraction, transformation, and loading (ETL) tool primarily used for impact analysis within the domain of a single DataStage Project and single set of users. That made things simple. We were playing in a closed sandbox, compiling a matrix of connected pathways that implemented a consistent approach to connectivity with a finite set of controls and operators. Automated lineage is more easily achieved when everything is consistent, from a single vendor, and with few unknown patterns. However, this is the equivalent of being blindfolded and locked in a closet!
That approach and viewpoint is now completely unrealistic, and frankly, useless. The modern data stack dictates that our lineage solutions be far more nimble, able to support a vast number of solutions. Now, lineage must be able to provide tools to connect things together using nuts and bolts when there aren’t any other methods.
Descriptive Lineage Use Cases
When discussing use cases for Descriptive Lineage, it is important to also consider the target user community for each. Below, the first two use cases are primarily aimed at a technical audience, as the lineage definitions apply to actual physical assets. The last two use cases are more abstract, at a higher level, and have direct appeal to less technical users interested in the big picture. Still – even low-level lineage for physical assets has value for everyone because it gets summarized by lineage tooling and bubbles up to “big picture” insights that are beneficial to the entire organization.
Critical (and quick) Bridges
The demand for lineage extends far beyond dedicated systems like the DataStage example above. However, the first of our use cases for Descriptive Lineage is often encountered in that single tool scenario. Even there, you will still discover situations that cannot be covered by automation. Examples include rarely seen usage patterns understood only by deep experts of a particular tool, strange new syntax that parsers are unable to comprehend, short lived but inevitable anomalies, missing chunks of source code, and complex wrappers around legacy routines and procedures. Simple scripted or manually copied sequential (flat) files are also covered by this use case.
Descriptive Lineage allows you to bind assets together that aren’t otherwise connected automatically. That goes for assets that aren’t connected either by an accident of the technology, a true missing link, or lack of permission to access actual source code. Descriptive Lineage, in this use case, is extending the linage that we already have, making it more complete, filling gaps and crossing bridges. I also like to refer this as hybrid lineage, which takes maximum advantage of automation while complementing that automation with additional assets and connection points (the “nuts and bolts”).
Support for new tooling
Our ever-expanding technology portfolios present the next major use case for Descriptive Lineage. As our industry explores new domains and new solutions to squeeze every ounce of value out of our data, we see the proliferation of environments where everything is touching our data.
It is rare that a site has just one dedicated tool set as mentioned above. Data is being touched and manipulated by a myriad of things – transformation solutions both on-premises and in the cloud, databases and now data lake houses everywhere, resources in still-living legacy systems and defunct or shiny new reporting tools. The sheer array of technologies in use today is mind boggling and ever growing. Automated lineage across the spectrum may be the objective, but there aren’t enough vendors, practitioners, and solution providers out there to create the ultimate automation “easy button” for such a complex universe. Therefore, there is a need for Descriptive Lineage to define new systems, new data assets, and new connection points and connect them to what already has been parsed or tracked using automation.
Application-level lineage
Descriptive Lineage is also used for higher level or application-level lineage, sometimes called business lineage. This is often difficult to realize using automation, precisely because there aren’t any fixed industry definitions for application-level lineage. The perfect definition of high-level lineage for one user or group of users may not fit the exact gem of a design that is envisioned by your lead data architects. Descriptive Lineage allows you to define the lineage that you need, at whatever depth that is required.
This is truly fit-for-purpose lineage, and typically stays at very high levels of abstraction, perhaps not even mentioning anything deeper than a particular database cluster or the name of an application area. Lineage for certain parts of a financial organization might be very generic, leading to a target area called “Risk Aggregation”.
Future Lineage
One more use case for Descriptive Lineage is “to-be”, or future lineage. The ability to model the lineage of your future applications (especially when realized in a hybrid form alongside your existing lineage definitions) helps you assess the work effort, measure potential impact on existing teams and systems, and track your progress along the way. Descriptive Lineage for future applications is not held back by the fact that the source code has not yet been returned or released, isn’t running in production, or is only outlined on a chalk board. Future lineage can stand alone or also be combined with existing lineage in the hybrid model described above.
These are just some of the ways that Descriptive Lineage complements your overall objectives for lineage visibility across the enterprise.
Benefits of Descriptive Lineage and next steps
In summary, descriptive lineage fills in the blanks, supports your future designs, bridges any gaps, and augments your overall lineage solutions. This yields deeper insights into your environment that lead to increased trust and the ability to make better business decisions.
When we return for the next post, we will discuss ways that Descriptive Lineage is implemented, and explore methodologies and approaches that ensure its success.
-ernie