kyuubi [KYUUBI #4693] Enhanced the table lineage for input tables

close #4693

Why are the changes needed?

How was this patch tested?

[x] Add some test cases that check the changes thoroughly including negative and positive cases if possible
[ ] Add screenshots for manual tests if appropriate
[x] Run test locally before make a pull request

Apr 11 '23 12:04 iodone

Codecov Report

Merging #4694 (44886f7) into master (f0615a9) will decrease coverage by 0.06%. The diff coverage is 76.25%.

@@             Coverage Diff              @@
##             master    #4694      +/-   ##
============================================
- Coverage     57.99%   57.93%   -0.06%     
  Complexity       13       13              
============================================
  Files           580      580              
  Lines         32218    32268      +50     
  Branches       4304     4322      +18     
============================================
+ Hits          18684    18696      +12     
- Misses        11749    11767      +18     
- Partials       1785     1805      +20

Impacted Files	Coverage Δ
...in/lineage/helper/SparkSQLLineageParseHelper.scala	`63.90% <76.25%> (+4.53%)`	:arrow_up:

... and 13 files with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

Apr 12 '23 07:04 codecov-commenter

I think we'd better to define lineage clearly. The current lineage is easy to follow that we only consider plan's output, which means if a column is used to do filter, sort or something do not affect schema, then we ignore it.

So if we want to consider those columns, how about adding a new mode to fully extract column and table lineage? For example:

INSERT INTO TABLE t
SELECT c1 FROM t1 WHERE c2 > 0 ORDER BY c3

-- The lineage should be:

ColumnUsage(to: String, from: String, usage: String)

Lineage(
  List("default.t1"),
  List("default.t"),
  List(
     ColumnUsage("c1", "default.t1.c1", "OUTPUT"),
     ColumnUsage("N/A", "default.t1.c2", "PREDICATE"),
     ColumnUsage("N/A", "default.t1.c3", "ORDERING")
  )
)

Apr 12 '23 08:04 ulysses-you

I think we'd better to define lineage clearly. The current lineage is easy to follow that we only consider plan's output, which means if a column is used to do filter, sort or something do not affect schema, then we ignore it.

So if we want to consider those columns, how about adding a new mode to fully extract column and table lineage? For example:
INSERT INTO TABLE t
SELECT c1 FROM t1 WHERE c2 > 0 ORDER BY c3

-- The lineage should be:

ColumnUsage(to: String, from: String, usage: String)

Lineage(
  List("default.t1"),
  List("default.t"),
  List(
     ColumnUsage("c1", "default.t1.c1", "OUTPUT"),
     ColumnUsage("N/A", "default.t1.c2", "PREDICATE"),
     ColumnUsage("N/A", "default.t1.c3", "ORDERING")
  )
)

Yes, from the perspective of column output, the current lineage relationship is clear. The main purpose of this PR is to analyze lineage relationships from the perspective of table lineage and consider any table involved in SQL as an input table.

Apr 13 '23 09:04 iodone