sqllineage
sqllineage copied to clipboard
Support non-SELECT column-level lineage
Until lastest version of 1.3.5, sqllineage analyzes SELECT based column-level lineage. But there is another type of llineage based on WHERE, ON and USING clause filtering (so-called "predicate pushdown" in Spark). Is there any plan on implementing this kind of lineage?
Sorry for the late reply, I was out for medical leave.
The column level lineage in WHERE, ON, USING are not handled for the moment. Theoretically we can parse the information, keep it and expose it via some new API. But it does not fit into the DAG design.
What's your preferred way to get this information?
I am realizing this is an issue that I am seeing in my needs as well.
For example, in the following code, the current lineage only says that SummaryTable.code_value
is sourced from ReferenceTable.code_value
, but it is equally helpful to know that it is also dependent on FactTable.code_value
. If it were an option to include predicates like this in the LineageRunner function I would think including SummaryTable.code_value <- FactTable.code_value
in the output would be very helpful.
INSERT INTO SummaryTable
SELECT
f.ID
, r.code_value
, r.code_description
FROM FactTable AS f
INNER JOIN ReferenceTable AS r
ON f.code_value = r.code_value
Sorry for the late reply, I was out for medical leave.
The column level lineage in WHERE, ON, USING are not handled for the moment. Theoretically we can parse the information, keep it and expose it via some new API. But it does not fit into the DAG design.
What's your preferred way to get this information?
It might be a good idea to add a specific column in the target table and link the columns referred in the source tables to this column. This is what sqlflow does (see the dashed lines and the RelationRows columns):
That's a good thought. As a similar alternative, perhaps a third level of lineage called something like "predicate-level lineage" that lists all the source fields that could have an effect on which rows are returned in the target table.