sqllineage icon indicating copy to clipboard operation
sqllineage copied to clipboard

Support non-SELECT column-level lineage

Open IterableTrucks opened this issue 2 years ago • 5 comments

Until lastest version of 1.3.5, sqllineage analyzes SELECT based column-level lineage. But there is another type of llineage based on WHERE, ON and USING clause filtering (so-called "predicate pushdown" in Spark). Is there any plan on implementing this kind of lineage?

IterableTrucks avatar Jul 20 '22 15:07 IterableTrucks

Sorry for the late reply, I was out for medical leave.

The column level lineage in WHERE, ON, USING are not handled for the moment. Theoretically we can parse the information, keep it and expose it via some new API. But it does not fit into the DAG design.

What's your preferred way to get this information?

reata avatar Aug 21 '22 08:08 reata

I am realizing this is an issue that I am seeing in my needs as well.

For example, in the following code, the current lineage only says that SummaryTable.code_value is sourced from ReferenceTable.code_value, but it is equally helpful to know that it is also dependent on FactTable.code_value. If it were an option to include predicates like this in the LineageRunner function I would think including SummaryTable.code_value <- FactTable.code_value in the output would be very helpful.

INSERT INTO SummaryTable
SELECT
  f.ID
, r.code_value
, r.code_description
FROM FactTable AS f
INNER JOIN ReferenceTable AS r
  ON f.code_value = r.code_value

martinswan avatar Apr 09 '24 17:04 martinswan

Sorry for the late reply, I was out for medical leave.

The column level lineage in WHERE, ON, USING are not handled for the moment. Theoretically we can parse the information, keep it and expose it via some new API. But it does not fit into the DAG design.

What's your preferred way to get this information?

It might be a good idea to add a specific column in the target table and link the columns referred in the source tables to this column. This is what sqlflow does (see the dashed lines and the RelationRows columns):

image

piekill avatar Apr 23 '24 03:04 piekill

That's a good thought. As a similar alternative, perhaps a third level of lineage called something like "predicate-level lineage" that lists all the source fields that could have an effect on which rows are returned in the target table.

martinswan avatar Apr 23 '24 19:04 martinswan