spark [SPARK-47009][SQL] Enable create table support for collation

What changes were proposed in this pull request?

Adding support for create table with collated columns using parquet

Why are the changes needed?

In order to support basic DDL operations for collations

Does this PR introduce any user-facing change?

Yes, users are now able to create tables with collated columns

How was this patch tested?

With UTs

Was this patch authored or co-authored using generative AI tooling?

No

Feb 14 '24 17:02 stefankandic

We should put more high-level information: what's the corresponding parquet type for string with collation? and how do we fix the parquet max/min column stats?

Feb 15 '24 06:02 cloud-fan

@cloud-fan added the info on min/max stats and pushdown, not sure about what you mean with the corresponding parquet type for string with collation, AFAIK there is no such thing

Feb 19 '24 16:02 stefankandic

AFAIK there is no such thing

Yes, and we should mention it in the PR description. We still map string with collation to parquet string type. This means we don't support cross-engine compatibility (collation is ignored when reading parquet files with other engines like Presto). It's OK, but we need to call it out.

Feb 19 '24 17:02 cloud-fan

@stefankandic the new test has failures

- add collated column with alter table *** FAILED *** (167 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 140.0 failed 1 times, most recent failure: Lost task 0.0 in stage 140.0 (TID 134) (localhost executor driver): java.lang.AssertionError: index (1) should < 1
[info] 	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:120)
[info] 	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:317)
[info] 	at org.apache.spark.sql.catalyst.expressions.SpecializedGettersReader.read(SpecializedGettersReader.java:36)
[info] 	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.get(UnsafeRow.java:312)
[info] 	at org.apache.spark.sql.connector.catalog.BufferedRowsReader.extractFieldValue(InMemoryBaseTable.scala:704)
[info] 	at org.apache.spark.sql.connector.catalog.BufferedRowsReader.$anonfun$get$1(InMemoryBaseTable.scala:678)
[info] 	at org.apache.spark.sql.connector.catalog.BufferedRowsReader.$anonfun$get$1$adapted(InMemoryBaseTable.scala:677)

Feb 26 '24 07:02 cloud-fan

@cloud-fan fixed the test failure, should be ready to merge now

Feb 26 '24 13:02 stefankandic

thanks, merging to master!

Feb 26 '24 13:02 cloud-fan

spark spark copied to clipboard

[SPARK-47009][SQL] Enable create table support for collation

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

spark
spark copied to clipboard