spark
spark copied to clipboard
[SPARK-47009][SQL] Enable create table support for collation
What changes were proposed in this pull request?
Adding support for create table with collated columns using parquet
Why are the changes needed?
In order to support basic DDL operations for collations
Does this PR introduce any user-facing change?
Yes, users are now able to create tables with collated columns
How was this patch tested?
With UTs
Was this patch authored or co-authored using generative AI tooling?
No
We should put more high-level information: what's the corresponding parquet type for string with collation? and how do we fix the parquet max/min column stats?
@cloud-fan added the info on min/max stats and pushdown, not sure about what you mean with the corresponding parquet type for string with collation, AFAIK there is no such thing
AFAIK there is no such thing
Yes, and we should mention it in the PR description. We still map string with collation to parquet string type. This means we don't support cross-engine compatibility (collation is ignored when reading parquet files with other engines like Presto). It's OK, but we need to call it out.
@stefankandic the new test has failures
- add collated column with alter table *** FAILED *** (167 milliseconds)
[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 140.0 failed 1 times, most recent failure: Lost task 0.0 in stage 140.0 (TID 134) (localhost executor driver): java.lang.AssertionError: index (1) should < 1
[info] at org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:120)
[info] at org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:317)
[info] at org.apache.spark.sql.catalyst.expressions.SpecializedGettersReader.read(SpecializedGettersReader.java:36)
[info] at org.apache.spark.sql.catalyst.expressions.UnsafeRow.get(UnsafeRow.java:312)
[info] at org.apache.spark.sql.connector.catalog.BufferedRowsReader.extractFieldValue(InMemoryBaseTable.scala:704)
[info] at org.apache.spark.sql.connector.catalog.BufferedRowsReader.$anonfun$get$1(InMemoryBaseTable.scala:678)
[info] at org.apache.spark.sql.connector.catalog.BufferedRowsReader.$anonfun$get$1$adapted(InMemoryBaseTable.scala:677)
@cloud-fan fixed the test failure, should be ready to merge now
thanks, merging to master!