[KYUUBI #4693] Enhanced the table lineage for input tables
close #4693
Why are the changes needed?
How was this patch tested?
-
[x] Add some test cases that check the changes thoroughly including negative and positive cases if possible
-
[ ] Add screenshots for manual tests if appropriate
-
[x] Run test locally before make a pull request
Codecov Report
Merging #4694 (44886f7) into master (f0615a9) will decrease coverage by
0.06%. The diff coverage is76.25%.
@@ Coverage Diff @@
## master #4694 +/- ##
============================================
- Coverage 57.99% 57.93% -0.06%
Complexity 13 13
============================================
Files 580 580
Lines 32218 32268 +50
Branches 4304 4322 +18
============================================
+ Hits 18684 18696 +12
- Misses 11749 11767 +18
- Partials 1785 1805 +20
| Impacted Files | Coverage Δ | |
|---|---|---|
| ...in/lineage/helper/SparkSQLLineageParseHelper.scala | 63.90% <76.25%> (+4.53%) |
:arrow_up: |
... and 13 files with indirect coverage changes
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
I think we'd better to define lineage clearly. The current lineage is easy to follow that we only consider plan's output, which means if a column is used to do filter, sort or something do not affect schema, then we ignore it.
So if we want to consider those columns, how about adding a new mode to fully extract column and table lineage? For example:
INSERT INTO TABLE t
SELECT c1 FROM t1 WHERE c2 > 0 ORDER BY c3
-- The lineage should be:
ColumnUsage(to: String, from: String, usage: String)
Lineage(
List("default.t1"),
List("default.t"),
List(
ColumnUsage("c1", "default.t1.c1", "OUTPUT"),
ColumnUsage("N/A", "default.t1.c2", "PREDICATE"),
ColumnUsage("N/A", "default.t1.c3", "ORDERING")
)
)
I think we'd better to define lineage clearly. The current lineage is easy to follow that we only consider plan's output, which means if a column is used to do filter, sort or something do not affect schema, then we ignore it.
So if we want to consider those columns, how about adding a new mode to fully extract column and table lineage? For example:
INSERT INTO TABLE t SELECT c1 FROM t1 WHERE c2 > 0 ORDER BY c3 -- The lineage should be: ColumnUsage(to: String, from: String, usage: String) Lineage( List("default.t1"), List("default.t"), List( ColumnUsage("c1", "default.t1.c1", "OUTPUT"), ColumnUsage("N/A", "default.t1.c2", "PREDICATE"), ColumnUsage("N/A", "default.t1.c3", "ORDERING") ) )
Yes, from the perspective of column output, the current lineage relationship is clear. The main purpose of this PR is to analyze lineage relationships from the perspective of table lineage and consider any table involved in SQL as an input table.