spark [SPARK-47015][SQL] Disable partitioning on collated columns

[SPARK-47015][SQL] Disable partitioning on collated columns

Open stefankandic opened this issue 1 year ago • 1 comments

What changes were proposed in this pull request?

Disable hive style partitioning on columns that are non default collated

Why are the changes needed?

With current implementation partitioning on columns that have either accent or case insensitive collation would lead to incorrect results.

Does this PR introduce any user-facing change?

Only compared to the latest master version

How was this patch tested?

With new UTs

Was this patch authored or co-authored using generative AI tooling?

Feb 14 '24 17:02 stefankandic

how about bucket columns? We generate the bucket id from the string value and assume all the semantically-same string values should generate the same bucket id, which isn't true for string with collation.

Feb 15 '24 06:02 cloud-fan

how about bucket columns? We generate the bucket id from the string value and assume all the semantically-same string values should generate the same bucket id, which isn't true for string with collation.

@mihailom-db this seems like a fairly straightforward task so can you take a look at it when you have the time? Should be not much different than this change

Feb 19 '24 16:02 stefankandic

+1, LGTM. Merging to master. Thank you, @stefankandic and @dbatomic @cloud-fan for review.

Feb 29 '24 16:02 MaxGekk

spark spark copied to clipboard

[SPARK-47015][SQL] Disable partitioning on collated columns

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

spark
spark copied to clipboard