Save and Application of Data Lineage

Feature Description

Data lineage refers to the bloodline-like relationships formed between data during its generation, processing, circulation and elimination in data governance. Simply put, it explains where data comes from and what processing/analysis it has undergone. Knowledge Graph provides solutions for data lineage extraction, storage, querying and visualization.

SuperMap supports writing data lineage relationships from GPA execution processes into graph databases. The constructed data graph can visually display data processing flows, and provide capabilities for upward tracing and downward tracking of datasets.

  • Upward tracing for queried datasets obtains their production chains, helping identify sources and locate causes when data quality issues occur.
  • Downward tracking for queried datasets reveals their subsequent impacts.

Saving GPA data lineage to graph databases means representing GPA tools, datasets and attributes as graph entities, while data processing flows as graph relationships. As shown below, input data (precipitation dataset) undergoes column update processing to generate new precipitation data, with graph nodes representing input/output data and GPA tools, connected by input/output relationships.

Steps

  1. Enable data lineage:
    • GPA Tab -> Data Lineage: Check Save Data Lineage to write execution workflows into graph database.
    • Directly in Workspace Manager: Select dataset -> Right-click -> Data lineage -> Save data lineage.
  2. If no graph database connection exists when enabling data lineage, connect first to enable lineage saving and application.
  3. Execute GPA model.
  4. Query or explore lineage for result datasets: 

     

    • Knowledge Graph Tab -> Graph Query: Use Cypher queries.
    • Select dataset in Workspace Manager -> Right-click -> Data lineage -> Trace Source or Track
    • When opening data lineage graph, click node properties in graph window to view attributes like data path, record count, coordinates, extent, execution time. Attribute changes in processing chains help explore data evolution.