Tracing Enterprise Data Footsteps! ......celebrating the journey of data!

Fine-tuning your data pipelines using Descriptive Lineage

June 22, 2023 — dsrealtime

Whenever we talk about data lineage and how to achieve it, the spotlight tends to shine on automation. That’s okay, as automating the process of calculating and establishing lineage is crucial to understanding and maintaining a trustworthy system of data pipelines. After all, lineage “Utopia” is to have everything automated, using various methodologies, so that lineage tracking evolves into a hands-off operation without human intervention. Consequently, little is often said about Descriptive, or manually derived lineage — an equally important tool for delivering a comprehensive lineage framework. Unfortunately, Descriptive Lineage doesn’t get the attention or recognition it deserves. If you say “manual stitching” among data professionals, everyone cringes and runs!

In her book “Data Lineage from a Business Perspective”, Dr. Irina Steenbeek introduces the concept of Descriptive Lineage as “a method to record metadata-based data lineage manually in a repository.” Let’s dive deeper into the definition of Descriptive Lineage, learn why it is so important for attaining complete clarity across your data pipelines, and explore its most important use cases.

Descriptive Lineage of the past

When I first started working on lineage in the late 1990’s, the team I was on was narrowly focused on one technology and use case: DataStage, an IBM extraction, transformation, and loading (ETL) tool primarily used for impact analysis within the domain of a single DataStage Project and single set of users. That made things simple. We were playing in a closed sandbox, compiling a matrix of connected pathways that implemented a consistent approach to connectivity with a finite set of controls and operators. Automated lineage is more easily achieved when everything is consistent, from a single vendor, and with few unknown patterns. However, this is the equivalent of being blindfolded and locked in a closet!

That approach and viewpoint is now completely unrealistic, and frankly, useless. The modern data stack dictates that our lineage solutions be far more nimble, able to support a vast number of solutions. Now, lineage must be able to provide tools to connect things together using nuts and bolts when there aren’t any other methods.

Descriptive Lineage Use Cases

When discussing use cases for Descriptive Lineage, it is important to also consider the target user community for each. Below, the first two use cases are primarily aimed at a technical audience, as the lineage definitions apply to actual physical assets. The last two use cases are more abstract, at a higher level, and have direct appeal to less technical users interested in the big picture. Still – even low-level lineage for physical assets has value for everyone because it gets summarized by lineage tooling and bubbles up to “big picture” insights that are beneficial to the entire organization.

Critical (and quick) Bridges

The demand for lineage extends far beyond dedicated systems like the DataStage example above. However, the first of our use cases for Descriptive Lineage is often encountered in that single tool scenario. Even there, you will still discover situations that cannot be covered by automation. Examples include rarely seen usage patterns understood only by deep experts of a particular tool, strange new syntax that parsers are unable to comprehend, short lived but inevitable anomalies, missing chunks of source code, and complex wrappers around legacy routines and procedures. Simple scripted or manually copied sequential (flat) files are also covered by this use case.

Descriptive Lineage allows you to bind assets together that aren’t otherwise connected automatically. That goes for assets that aren’t connected either by an accident of the technology, a true missing link, or lack of permission to access actual source code. Descriptive Lineage, in this use case, is extending the linage that we already have, making it more complete, filling gaps and crossing bridges. I also like to refer this as hybrid lineage, which takes maximum advantage of automation while complementing that automation with additional assets and connection points (the “nuts and bolts”).

Support for new tooling

Our ever-expanding technology portfolios present the next major use case for Descriptive Lineage. As our industry explores new domains and new solutions to squeeze every ounce of value out of our data, we see the proliferation of environments where everything is touching our data.

It is rare that a site has just one dedicated tool set as mentioned above. Data is being touched and manipulated by a myriad of things – transformation solutions both on-premises and in the cloud, databases and now data lake houses everywhere, resources in still-living legacy systems and defunct or shiny new reporting tools. The sheer array of technologies in use today is mind boggling and ever growing. Automated lineage across the spectrum may be the objective, but there aren’t enough vendors, practitioners, and solution providers out there to create the ultimate automation “easy button” for such a complex universe. Therefore, there is a need for Descriptive Lineage to define new systems, new data assets, and new connection points and connect them to what already has been parsed or tracked using automation.

Application-level lineage

Descriptive Lineage is also used for higher level or application-level lineage, sometimes called business lineage. This is often difficult to realize using automation, precisely because there aren’t any fixed industry definitions for application-level lineage. The perfect definition of high-level lineage for one user or group of users may not fit the exact gem of a design that is envisioned by your lead data architects. Descriptive Lineage allows you to define the lineage that you need, at whatever depth that is required.

This is truly fit-for-purpose lineage, and typically stays at very high levels of abstraction, perhaps not even mentioning anything deeper than a particular database cluster or the name of an application area. Lineage for certain parts of a financial organization might be very generic, leading to a target area called “Risk Aggregation”.

Future Lineage

One more use case for Descriptive Lineage is “to-be”, or future lineage. The ability to model the lineage of your future applications (especially when realized in a hybrid form alongside your existing lineage definitions) helps you assess the work effort, measure potential impact on existing teams and systems, and track your progress along the way. Descriptive Lineage for future applications is not held back by the fact that the source code has not yet been returned or released, isn’t running in production, or is only outlined on a chalk board. Future lineage can stand alone or also be combined with existing lineage in the hybrid model described above.

These are just some of the ways that Descriptive Lineage complements your overall objectives for lineage visibility across the enterprise.

Benefits of Descriptive Lineage and next steps

In summary, descriptive lineage fills in the blanks, supports your future designs, bridges any gaps, and augments your overall lineage solutions. This yields deeper insights into your environment that lead to increased trust and the ability to make better business decisions.

When we return for the next post, we will discuss ways that Descriptive Lineage is implemented, and explore methodologies and approaches that ensure its success.

-ernie

Posted in RealTime. Leave a Comment »

New Horizons for Lineage!

May 11, 2023 — dsrealtime

Hi Everyone,

It has been awhile since I’ve written any observations about lineage and our world of data integration and metadata. I’ve been spending time these past four years listening and learning. During that time I have been able to gather more insights about lineage, data governance and management, the evolution of open-source (for lineage and metadata), and hot new trends like “Observability” and “Active Metadata.”

(see my prior post on “What is Data Lineage (a Reprise) from four years ago” https://wordpress.com/post/dsrealtime.wordpress.com/921 )

With that in mind, it’s time to take this blog to places it hasn’t been before — to new technologies and methods for defining, qualifying, and establishing lineage. But also to new use cases and new ways of thinking about meeting the challenges being faced while understanding your data pipelines.

I have gained a deeper appreciation for tracking lineage (not only for Classic DataStage as covered for many years in this blog) but also for a myriad of other Extract-Transform-Load (ETL) solutions, known and unknown relational databases, a vast array of long outdated on-premises tools, sparkling new cloud-based data shaping solutions, as well as the next generation of DataStage!

Stay tuned! It’s been a fun journey and fulfilling learning experience for me, and I look forward to sharing it with you, starting with a series of posts that look closely at the need for something called Descriptive Lineage.

In the meantime, have some fun with this recent experiment in presenting lineage in Virtual Reality!

Futuristic demo of lineage in three dimensions! Zoom session VR lineage demo. Watch this lab session where you can see the end result while also watching the speaker wearing a VR headset.
Short teaser video for Lineage VR: The Future of Lineage?
Try it yourself! Move your mouse to look around in this active 360 degree view! https://www.youtube.com/watch?v=7kajf8tT-HY

-Ernie

Posted in RealTime. Leave a Comment »

What is Data Lineage? (a reprise)

July 27, 2019 — dsrealtime

(from Meriam-Webster) re-prise: a recurrence, renewal, or resumption of an action

Hi Everyone.

Ten years ago I posted an entry called “What exactly is Data Lineage?” Ten years!

https://dsrealtime.wordpress.com/2009/12/15/what-exactly-is-data-lineage/

Since that time, the concept of lineage has evolved and grown and taken on more meanings. Lineage is now a major topic in every conversation that surrounds data, regulatory compliance, governance, metadata management, decision support, artificial intelligence, data intelligence, data quality, machine learning, and much, much more. Let’s quickly review what ten years has done to affect the definition of “lineage”…

Ten years ago, we barely started uttering the words “information governance” or “data governance”. Today, “governance” is just one small part of the lineage equation.
Ten years ago, Hadoop and Data Lakes were in their infancy, and we were just starting to grasp the explosion of data we are swimming in today.
Ten years ago, we were exploring the display of lineage on our laptops, and “maybe” on our Blackberries. Today we expect graphical rendering on any device
Ten years ago, many questioned whether we would need lineage for COBOL and other legacy systems alongside lineage for modern ETL tooling and coding methods. Now we demand lineage for everything!
Ten years ago, we didn’t think there would be any chance for metadata and lineage standardization. Today there are initiatives underway for common models and metadata sharing protocols.
Ten years ago, we weren’t thinking about lineage for ground-to-cloud, Spark, or lineage to illustrate decisions made by data citizens building machine learning or predictive analytic models. Today we are spawning new methods in open source and data science that demand lineage engineering.
Ten years ago…(I am sure you can think of many more…)…

Ten years ago! Whew.

Today there are a multitude of web sites where you can dive into the topic of lineage. Depending on your background, or interest, you can find resources pointing to everything from the calculation of lineage and its representation within a mathematical graph to the use of lineage for predicting bottlenecks and potential security breaches, and everything in-between. You will find many definitions of lineage and its nuances.

Here are the major areas and definitions of lineage that are trending at my customers.

Data Lineage. The basic definition of data lineage has remained constant, albeit with a lot of sub-divisions and “extended” descriptions. Data Lineage is a representation of the “flow” of data, with the ability to trace what “happened” (or will happen) to that data, going back to “where it came from” or illustrating “where it goes to”. The “extensions” of that definition generally branch out in terms of the level of granularity and the kind of lineage that is being tracked, traced, or followed. I won’t try to list them all here — it would be redundant. I encourage you to look at your own requirements, and then what YOUR users need.

Regarding level of granularity, how deep does your lineage need to go? How many different ways does it need to be rendered? Do you need low level technical lineage that drills into individual expressions and the actual “if…then…else” syntax that exists in your source code? Or are users overwhelmed by that much information and need a higher level “Business Lineage” or “Conceptual Lineage” showing the general flow of data through your information lifecycle or the logical handling of your critical assets? Do you need both? Can you achieve either of these levels of granularity automatically? Are parsers/scanners/bridges available? Do you even have access to the source of your integration programs if an automated solution exists? As you look at lineage solutions or build your own, first understand the granularity you want and need, based on your consumers and their use cases.

Regarding the types of lineage, what are you trying to achieve? As with the topic of granularity, determine what “kinds” of lineage YOUR users require. Here are just a few of the types of lineage that are practiced and/or are being discussed.

Design Lineage What the code, the process or ‘thing’ you are exploring for lineage is “supposed” to do.
Operational or Run-time lineage. What the code, process or ‘thing’ you are exploring for lineage actually “did” (last night, last week, last version, last <fill in the blank>). This discussion usually gets deep into capturing actual run-time parameter values.
Process Flow Lineage (flow of control, as opposed to flow of data). Which processes call other processes? How are your systems initially invoked or “kicked off”? This will then also have its own design vs runtime considerations.
…and finally, a type that is being driven harder and faster by the growing concerns for the handling of personal data. This is “Value-based Lineage” or “Data Provenance”. Years ago, this was largely in the domain of customer relationship management and points to the ability to trace how a “specific, individual” record flows or flowed thru the system. This is of course critical now for GDPR, CCPA and similar efforts to “really know” where particular personal information lies and where it is going.

Why all these new definitions and branching disciplines? During these ten years, the domain of lineage has not stopped growing. We are coding fast and furiously with new tools and new environments (without doing lineage “up front” where it would be less expensive and simpler to implement), and we are also continually realizing new use cases and solutions where data lineage can provide value and insight. Lineage is not “just” for impact analysis, and it is no longer “just” for improved decision making and data quality. It’s value for regulatory compliance, actionable data management, performance analysis, data protection, and more, are just starting to be realized.

Besides new “things” to scan and parse, what is next? Expect to see more progress with “open metadata” and standardization. Apache Atlas and now ODPi Egeria (https://egeria.odpi.org/) are leading to multi-vendor development of a common model for sharing — not only general metadata, but also lineage information. This offers the promise of untangling complex efforts to ingest, reconcile, and normalize lineage details from diverse and otherwise incongruent repositories.

The next challenge will be learning how to better exploit our increasing insight into lineage. What are we doing with the insight? How are we taking advantage of what lineage delivers? What “should” we be doing with it?

I am looking forward to the next steps in this journey!

Ernie

*** update ***

Hi everyone. Several weeks ago I left IBM for a new opportunity with MANTA Software ( www.getmanta.com ). I am looking forward to continuing to drive innovations in data lineage and common understanding of data to meet all the challenges above and more! Thank you to everyone for your support and encouragement! -ernie

Posted in RealTime. 3 Comments »

Open Metadata Sharing with ODPi/Egeria and IGC

October 2, 2018 — dsrealtime

Hi everyone…..here to share with you the continuing evolution of Open Metadata and its soon-to-be-released implementation for Information Server 11.7 and the Information Governance Catalog (IGC).

Recently I was given the opportunity to start working with what is being called the
igc-omrs-connector — an implementation of ODPi/Egeria and its Open Metadata Repository Services (OMRS) api’s that enable IGC to be the first OMRS-Compliant repository!

The link below points to a demonstration that illustrates the real-time and bi-directional sharing of metadata between two instances of IGC. It reviews several key concepts of ODPi/Egeria and OMRS (such as the meaning of a “cohort”) and then dives deeper to illustrate and “watch” (with windows into the kafka topics that help enable OMRS sharing) the sharing of metadata between the repositories using OMRS.

The 15 minute recording starts with a brief overview of ODPi/Egeria and OMRS, and then moves into the actual demonstration to illustrate the sharing of technical database metadata. It continues afterwards with the sharing of glossary information and then the assignment of business terms to technical assets.

The video highlights ODPi/Egeria for the sharing of metadata between two instances of IGC, which is attractive for its application of real-time, bi-directional, and automated metadata exchange; however, the real value is when (in the next few months) Apache Atlas and other repositories _also_ become OMRS-Compliant — thus enabling metadata sharing among _independent_ repositories! This is where the benefits of Open Metadata will be fully realized. Costs associated with building and maintaining custom bridging solutions can be reduced, and developers, business users and data scientists alike will be able to more easily find, understand and validate valuable data assets and their meaning while further exploiting metadata driven solutions throughout their organization.

https://www.youtube.com/watch?v=P_RhQXXEbd4&t=11s

Stay tuned. As additional compliant repositories come on-line, I will profile and (where possible) demonstrate their capabilities here, and continue this important discussion!

Thanks.

-ernie

Posted in Data Governance, Information Governance, metadata, Metadata Management. Tags: apache atlas, Egeria, igc, information governance catalog, ODPi/Egeria, Open Metadata. Leave a Comment »

Please welcome Egeria!

August 30, 2018 — dsrealtime

Egeria

Hi all…

Please welcome Egeria — The new open source project for “Open Metadata” ! Initially maturing as part of the Apache Atlas project, Egeria has now evolved and “grown up”! It is now its “own” metadata sharing initiative, and is part of The Linux Foundation and ODPi. There are an increasing number of vendors and participating companies getting involved, and Information Server/IGC and Apache Atlas are both adapting Egeria and its APIs to become “Egeria Compliant”. I will be specifically sharing news of IGC’s support of this important technology on these pages, and hope soon to have some video recordings and other information that outline the functionality. In the meantime, please explore the Egeria pages, help continue the discussion, and consider joining the effort!

Check out this blog post about the initial release…

https://www.odpi.org/blog/2018/08/27/first-release-of-odpi-egeria-is-here#.W4UxBla6ZJ4.twitter

Ernie

Posted in RealTime. Leave a Comment »

Apache Atlas .8 and integration with IGC !

May 18, 2018 — dsrealtime

Hi all…

Recently we made a metadata bridge available between the Information Governance Catalog (IGC) and Apache Atlas .8. Apache Atlas continues to evolve, and its 1.x implementation will soon be arriving, but many sites are using .8 today and would like to achieve at least “some” levels of automated integration between Hadoop and their Information Server repository , even if it is just to evaluate the possibilities for greater enterprise governance. The link below points to a recording that illustrates the sharing of metadata between IGC and Apache Atlas .8.

https://youtu.be/ZAH3ui1gUm0

The interface to Apache Atlas 1.x, which is being worked on, will operate in a similar fashion, though of course using all that the evolving Apache Atlas 1.x has to offer regarding Open Metadata Check out this link for more details on Open Metadata at the Apache Atlas Wiki….

Thanks!

Ernie

Posted in RealTime. 3 Comments »

THINK 2018 Roundup

March 24, 2018 — dsrealtime

Hi everyone!

Just returned from a great week at our annual user conference in Las Vegas! This was a much larger event than in the past, as it encompassed far more of the IBM portfolio than just Information Server or Analytics. The halls and sessions across the venue were more crowded, but there was a lot more to learn! Besides entirely new topics on the technology and issues facing our business, there were sessions on the use and integration of all parts of Information Server and other offerings across the IBM and partner portfolio. I always learn more things about this platform at this event and enjoy hearing about our customers’ successes while providing insight and assistance where I can.

Specific to the Unified Governance and Integration area, THINK 2018 was a major opportunity to showcase Information Server release 11.7. There were demos and sessions on everything from the integration of structured and un-structured data to the new Data Flow Designer for DataStage and machine learning. In the next few weeks I will post details on my experiences with these new capabilities, especially as they relate to governance and metadata management.

Another exciting moment at this year’s conference was the opportunity to see and hear more about the continued evolution of Apache Atlas. Members of the Apache Atlas team (IBMers, partners and customers) conducted a hands-on-lab that highlighted the progress being made on Open Metadata. This included a powerful use case incorporating Apache Atlas and Ranger for dynamic data masking, and also illustrated the integration of Apache Atlas with the Infosphere Information Governance Catalog. This is a working “proof point” for how independent repositories can share metadata using the Open Metadata APIs. The team also participated in a panel discussion to discuss the value of ODPi, where all of us interested in governance can contribute to the success of Open Metadata.

RedguideSigning

Several members of the team participated in a book signing for a new “Redguide” that they recently authored, which reviews important use cases for Open Metadata and brings us up to date on the current progress of the Apache Atlas initiative! This is a MUST READ for all of us who are passionate about metadata and governance!

–ernie

Check out the links below for further information and details.

Apache Atlas…open source metadata and governance! http://atlas.apache.org/

The new Redguide, The Journey Continues http://www.redbooks.ibm.com/Abstracts/redp5486.html?Open

Open Metadata (…at the Apache Atlas Wiki)

https://cwiki.apache.org/confluence/display/ATLAS/Open+Metadata+and+Governance

ODPi Project for Data Governance https://www.odpi.org/projects/data-governance-pmc

Posted in RealTime. Leave a Comment »

Lost in translation?

December 7, 2017 — antonknopf

editor’s note: It gives me great pleasure to introduce Beate Porst, a good friend and colleague, who is the Offering Manager for DataStage and other parts of the Information Server platform. Beate will be sharing her insights into Unified Governance and Integration, based on many years of experience with this platform and the issues surrounding data transformation and management. Today she introduces some of the key new capabilities of Information Server v11.7. Please welcome Beate to dsrealtime! –ernie

How IBM Information Server v11.7 could have saved NASA’s 125-million dollar Mars orbiter from becoming lost.

We all know the slogan: Measure twice, cut once. What if we do but don’t know the context of our data?

That is what happened to NASA in 1999. While using the right numbers, their 125-million-dollar Mars orbiter was designed to use the metric system but mission control performed course corrections using the imperial system. This resulted in a too low altitude and contact to the orbiter was lost. An embarrassing moment for NASA.

But it wasn’t the only incident. In 2003, German and Swiss engineers started to build a bridge over the river Rhine in the border town of Laufenburg. Each country started to build the bridge on their side with the goal to meet in the middle. So the plan. Engineers used “sea level” as the reference point. Problem is that sea level in Germany is based on the North Sea where in Switzerland it is based on the Mediterranean, resulting in a 27cm difference. Now, builders in Germany knew the difference but apparently not whether to add or subtract that difference from their base. So they made the wrong choice.

Bridge_Waa

Historical documents show that using out of context, incomplete or inaccurate data has caused problems ever since mankind started to develop different units of measurement.

Now the question is how can you avoid costly incidents such as the above and successfully conquer your data problems and how can IBM Information Server help you in that journey?

Whether you want to build a bridge, send an orbiter to Mars or simply try to identify new markets, you will only be as good as the data you use. This means, it must be complete, in context, trusted and easily accessible in order to drive insights. As if this isn’t challenging enough, your competitiveness also depends on your organizations ability to quickly adapt to changing conditions.

For more than a decade, IBM InfoSphere Information Server has been one of the market-leading platforms for data integration and governance. Users have relied on its powerful and scalable integration, quality and governance capabilities to deliver trusted information to their mission critical business initiatives.

John Muir once wrote: “The power of imagination makes us infinite”. We have applied our power of imagination to once again reinvent the Information Server platform.

As business agility depends on the flexibility, autonomy, competency, and productiveness of the tools that power your business, we have infused Information Server’s newest release with a number of game changing inventions which include deeper insights into the context and relationship amongst your data, increased automation for your users to complete their work faster and saver, and more flexibility workloads for higher resource optimization. All of those are aimed at making your business more successful when tackling your most challenging data problems.

Let’s look at 4 of those game changing inventions and how they are going to help your business:

Contextual Search: Out of context data was the leading cause of error for NASA’s failed mission. The new contextual search feature called Enterprise Search provides your users with the context to avoid such costly mistakes. It greatly simplifies and accelerates the understanding, integration, and governance of enterprise data. Users can visually search, explore and easily gain insights through an enriched search experience powered by a knowledge graph. The graph provides context, insight and visibility across enterprise information giving you a much better understanding and awareness of how data is related, linked, and used.
Cognitive Design: Getting trusted data to your end users quickly is an imperative. This process starts with your integration design environment. To help address your data integration, transformation or curation needs quickly, Information Server V11.7 now includes a brand new versatile designer, called DataStage™ Flow Designer. It features an intuitive, modern, and secure interface accessible to all users through a no-install, browser-based experience, accelerating your users’ productivity through automatic schema propagation, highlighted design errors, powerful type ahead search as well as full backwards compatibility to the desktop version of the DataStage™ Designer.
Hybrid Execution: Data Warehouse optimization is one of the leading use cases to address growing data volumes while simplifying and accelerating data analytics. Once again, Information Server V11.7 has strengthened its ability to run on Hadoop with a set of novel features to more efficiently operationalize your Data Lake environment. Amongst those, is an industry unique hybrid execution feature which lets you balance integration workloads across a Hadoop and non-Hadoop environment aimed at minimizing data movements and optimizing your integration resources.
Automation powered by machine learning: Poor data quality is known to cost businesses millions of dollars each year. The inadvertent use of different units of measurements for the Mars orbiter was ultimately a data quality problem. However, the high manual work combined with exponential data growth continues to be an inhibitor for businesses to maintain high data quality. To counter this, Information Server V11.7 is further automating the data quality process, by underpinning data discovery and classification with machine learning, so that you can spent your time focusing on your business goals. The two innovative aspects are:

Automation rules which lets business users define graphical rules which then automatically apply data rule definitions and quality dimensions to data sets based business term assignments and

One-click automated discovery which enables discovery and analysis of all data from a connection in one click providing easy and fast analysis of hundreds or thousands of data sets

Don’t want to get lost in translation? Choose IBM Information Server V11.7 for your next data project.

Posted in Data Governance, datastage, general, Information Governance, RealTime. Leave a Comment »

Apache Atlas Update: Have you been watching?

November 16, 2017 — dsrealtime

It has been awhile since I’ve written anything. Time to “catch up!”

A lot has been happening in the world of metadata management and governance. We are now seeing many real life use cases, as machine learning, intelligent data classifications, graph database technology and more are being applied to the information governance domain. Efforts for standardization in the metadata and governance space are moving forward also. For this post, let’s take a look at Apache Atlas.

Apache Atlas continues to mature, celebrating several major milestones in 2017. Shortly after its second birthday (Apache Atlas was launched as an incubator project in May of 2015), Apache Atlas graduated to a top level project status signifying that the project’s community and products have been well-governed under the Apache Software Foundation’s (ASF) meritocratic process and principles. This is evidence of the hard work performed by the collective Apache Atlas team that Apache Atlas is increasingly ready for real world implementations. Of course, that milestone, while worthy of recognition, is just one of the many steps Atlas is taking, and continues to make, going forward. Here are other significant developments for Apache Atlas this year:

Introduction of OMRS and its other complementary APIs. OMRS is a key part of the Open Metadata framework that introduces the notion of repository metadata sharing and access. In the true spirit of Apache communities, Apache Atlas is not alone in the world of enabling information governance; sharing of metadata between diverse metadata repositories can now be realized, in addition to simpler federation of metadata across multiple Atlas repositories.
New common models for critical types of metadata. To facilitate metadata sharing via OMRS, and to establish a more widely adaptable set of asset definitions, it was agreed by the Atlas team that a common definition for data structures, processes, and other data asset attributes. This helps facilitate metadata sharing by increasing the likelihood that integrators building interfaces to Atlas will choose a common type definition for their content instead of designing their own custom types while providing extension points if needed.
New Glossary Model. A detailed new glossary model was designed (and API implemented) for a stronger semantic layer. Business concepts and their relationships are the cornerstone of disciplined information governance.
Streamlining of the Apache Atlas infrastructure. The underlying graph database implementation was upgraded to take maximum advantage of JanusGraph, itself becoming the leading standard for open source graph engines.
Continued/ongoing clean-up of the install and build procedures. Considering the wider adoption of Apache Atlas throughout the governance community, Atlas team has enhanced test suites to assure that the new functionality added is well tested and the build and install processes are more streamlined.. For example, packaging and building Apache Atlas within Docker containers.
The number of new Committers! Apache, as everyone knows (or should know), is a meritocracy. This means that recognition and influence is determined by an acknowledged investment of time, effort, and contributions. Formal recognition as a committer requires many months of hard work to moving a project forward. Congratulations to all the new Committers this year! Even more important, the increase in Committers and contributors overall is yet another illustration of how Apache Atlas is growing in importance and general industry awareness.
The Virtual Data Connector use case. Self service data exploration environments need to provide an integrated view of data from many different systems and organizations. Access is needed in order to discover new uses and interesting patterns in the data. The VDC project aims to provide a single endpoint for accessing data that presents a virtualized view of the data assets with the appropriate data security. This is accomplished by extending the integration of Apache Atlas with Apache Ranger via the tag-based security access introduced in Apache Atlas in 2016, in order to provide security access based on both the classification tags (eg PII and SPI tags, subject area of the data etc.) An additional plug-in is added to Apache Atlas to control access to metadata based on whether an end-user is allowed to discover a data sources’ metadata.

So….it’s been a very busy year for Apache Atlas. While most of these capabilities have already been developed and are being tested, they will become generally available in the upcoming Apache Atlas v1.0 which will be a huge milestone release for the community. The project is maturing, and gaining increased attention across the industry, in the information governance space, and beyond. The code continues to mature, with increase in adoption and variety of applications every week. The critical mass of industry expertise contributing to Apache Atlas continues to grow. Start watching! Start playing! Join in and help Apache Atlas reach its next set of milestones!

–Ernie

Main Apache Atlas web site
Atlas Wiki

Links to specific Apache Atlas Topics

Open Metadata and Governance
Link to more details on OMRS
Building Out the Open Metadata Typesystem
Virtual Data Connector

Posted in RealTime. Leave a Comment »

New Governance Blog covering IGC

November 15, 2017 — dsrealtime

Hi everyone…. here’s a pointer to another IGC and Governance resource written by some of my IBM colleagues….. this post includes details on the advantages of using OpenIGC to extend governance to any kind of assets… https://ibm.co/2AGpdaq . Happy reading!

Ernie

Posted in RealTime. Leave a Comment »

« Older posts

Tracing Enterprise Data Footsteps! ……celebrating the journey of data!

What’s this Blog about?

Select Posts by Area

Follow dsrealtime via email

Fine-tuning your data pipelines using Descriptive Lineage

Descriptive Lineage of the past

Descriptive Lineage Use Cases

Benefits of Descriptive Lineage and next steps

New Horizons for Lineage!

What is Data Lineage? (a reprise)

Open Metadata Sharing with ODPi/Egeria and IGC

Please welcome Egeria!

Apache Atlas .8 and integration with IGC !

THINK 2018 Roundup

Lost in translation?

How IBM Information Server v11.7 could have saved NASA’s 125-million dollar Mars orbiter from becoming lost.

Apache Atlas Update: Have you been watching?

New Governance Blog covering IGC

please note

Recent posts