spark
spark copied to clipboard
[SPARK-47015][SQL] Disable partitioning on collated columns
What changes were proposed in this pull request?
Disable hive style partitioning on columns that are non default collated
Why are the changes needed?
With current implementation partitioning on columns that have either accent or case insensitive collation would lead to incorrect results.
Does this PR introduce any user-facing change?
Only compared to the latest master version
How was this patch tested?
With new UTs
Was this patch authored or co-authored using generative AI tooling?
No
how about bucket columns? We generate the bucket id from the string value and assume all the semantically-same string values should generate the same bucket id, which isn't true for string with collation.
how about bucket columns? We generate the bucket id from the string value and assume all the semantically-same string values should generate the same bucket id, which isn't true for string with collation.
@mihailom-db this seems like a fairly straightforward task so can you take a look at it when you have the time? Should be not much different than this change
+1, LGTM. Merging to master. Thank you, @stefankandic and @dbatomic @cloud-fan for review.