spark [SPARK-38932][SQL] Datasource v2 support report distinct keys

[SPARK-38932][SQL] Datasource v2 support report distinct keys

Open ulysses-you opened this issue 2 years ago • 2 comments

What changes were proposed in this pull request?

Add a new mix in interface SupportsReportDistinctKeys for datasource v2
Add a new method reportDistinctKeysSet in LeafNode
Override reportDistinctKeysSet in datasource v2 relation
Propagate reportDistinctKeysSet at DistinctKeysVisitor

Why are the changes needed?

Datasource v2 can be used to connect to some databases who support unique key.

Spark catalyst optimizer support do further optimization through distinct keys. So it can improve the performance if the Scan reports its distinct keys to Spark.

We already have several optimizer rules for distinct keys, for example:

https://github.com/apache/spark/pull/35779
https://github.com/apache/spark/pull/36117
https://github.com/apache/spark/pull/36530

We also have some prs which is in progress related distinct keys, for example:

https://github.com/apache/spark/pull/37267
https://github.com/apache/spark/pull/36180

Does this PR introduce any user-facing change?

yes, a new interface added for developer

How was this patch tested?

add test

Apr 19 '22 05:04 ulysses-you

cc @cloud-fan @sigmod @wangyum

Apr 19 '22 06:04 ulysses-you

cc @cloud-fan @huaxingao if you have time to take a look, thank you

Jul 29 '22 02:07 ulysses-you

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Nov 07 '22 00:11 github-actions[bot]

spark spark copied to clipboard

[SPARK-38932][SQL] Datasource v2 support report distinct keys

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard