spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-47015][SQL] Disable partitioning on collated columns

Open stefankandic opened this issue 1 year ago • 1 comments

What changes were proposed in this pull request?

Disable hive style partitioning on columns that are non default collated

Why are the changes needed?

With current implementation partitioning on columns that have either accent or case insensitive collation would lead to incorrect results.

Does this PR introduce any user-facing change?

Only compared to the latest master version

How was this patch tested?

With new UTs

Was this patch authored or co-authored using generative AI tooling?

No

stefankandic avatar Feb 14 '24 17:02 stefankandic

how about bucket columns? We generate the bucket id from the string value and assume all the semantically-same string values should generate the same bucket id, which isn't true for string with collation.

cloud-fan avatar Feb 15 '24 06:02 cloud-fan

how about bucket columns? We generate the bucket id from the string value and assume all the semantically-same string values should generate the same bucket id, which isn't true for string with collation.

@mihailom-db this seems like a fairly straightforward task so can you take a look at it when you have the time? Should be not much different than this change

stefankandic avatar Feb 19 '24 16:02 stefankandic

+1, LGTM. Merging to master. Thank you, @stefankandic and @dbatomic @cloud-fan for review.

MaxGekk avatar Feb 29 '24 16:02 MaxGekk