elementary icon indicating copy to clipboard operation
elementary copied to clipboard

[Feature] Column Level Lineage

Open oravi opened this issue 2 years ago • 1 comments

Feature description

  • Currently the lineage is at the table level which means that if you need to change or monitor a specific column or a set of columns, you should manually search in the relevant tables how each column is transformed and what its current status in each step or table in your data flow.

  • In order to easily understand the upstream & downstream dependencies of specific columns and their live status (volume / freshness for example) we should also parse and extract dependencies between columns as part of the query processing phase.

  • Showing the entire column level lineage for an average environment could be overwhelming, we will probably start by presenting the lineage graph of chosen columns.

  • A new option in the CLI could be added to support selecting a specific column (similar to the table filter option), then using the direction / depth command line options the relevant upstream / downstream (or both) dependencies will be presented in the lineage graph.

Feature output

feature_output

Feedback

We would love to hear any feedback / comments / requests about this feature.

oravi avatar Oct 28 '21 16:10 oravi

We're using dbt right now, and column level lineage would be a killer feature add to that ecosystem. There are many, many person-hours devoted to "blast radius" analysis around questions like "Is this field from SFDC used in any reports? If so, which ones are it used in?" Currently, that's a manual process to trace the lineage throughout dbt, then manual process to evaluate which BI dashboards are consuming it. You can document exposures in dbt now, which gets you closer, but have the column lineage leading up to the data mart backing the "exposure" would be super valuable.

joshuamoore-procore avatar Nov 02 '21 16:11 joshuamoore-procore

This issue is stale because it has been open for too long with no activity. If you would like the issue to remain open, please remove the stale label or leave a comment.

github-actions[bot] avatar Dec 09 '22 02:12 github-actions[bot]

This will be a game changer if it gets rolled out.

kelvinheng92 avatar Feb 14 '23 13:02 kelvinheng92

Sqlglot has a lineage function that can parse through a DAG of sql views and produce column lineage across them. I imagine that library could be used to implement this feature.

AdeelK93 avatar May 27 '23 15:05 AdeelK93

@AdeelK93 - it looks like sqlglot can tell you which columns are used in a query, but I don't think it can tell you "columnA was used to generate columnB", which would be essential to traversing the lineage from columnA in the staging model to columnZ in the data mart. Please correct me if this is wrong.

I also see the sqllineage python package which has column level lineage, which might be helpful as a lower level tool: https://sqllineage.readthedocs.io/en/latest/first_steps/advanced_usage.html

tommyh avatar May 31 '23 13:05 tommyh

You should try it out! It does produce a whole lineage tree

AdeelK93 avatar May 31 '23 14:05 AdeelK93