spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-47009][SQL] Enable create table support for collation

Open stefankandic opened this issue 1 year ago • 1 comments

What changes were proposed in this pull request?

Adding support for create table with collated columns using parquet

Why are the changes needed?

In order to support basic DDL operations for collations

Does this PR introduce any user-facing change?

Yes, users are now able to create tables with collated columns

How was this patch tested?

With UTs

Was this patch authored or co-authored using generative AI tooling?

No

stefankandic avatar Feb 14 '24 17:02 stefankandic

We should put more high-level information: what's the corresponding parquet type for string with collation? and how do we fix the parquet max/min column stats?

cloud-fan avatar Feb 15 '24 06:02 cloud-fan

@cloud-fan added the info on min/max stats and pushdown, not sure about what you mean with the corresponding parquet type for string with collation, AFAIK there is no such thing

stefankandic avatar Feb 19 '24 16:02 stefankandic

AFAIK there is no such thing

Yes, and we should mention it in the PR description. We still map string with collation to parquet string type. This means we don't support cross-engine compatibility (collation is ignored when reading parquet files with other engines like Presto). It's OK, but we need to call it out.

cloud-fan avatar Feb 19 '24 17:02 cloud-fan

@stefankandic the new test has failures

- add collated column with alter table *** FAILED *** (167 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 140.0 failed 1 times, most recent failure: Lost task 0.0 in stage 140.0 (TID 134) (localhost executor driver): java.lang.AssertionError: index (1) should < 1
[info] 	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:120)
[info] 	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:317)
[info] 	at org.apache.spark.sql.catalyst.expressions.SpecializedGettersReader.read(SpecializedGettersReader.java:36)
[info] 	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.get(UnsafeRow.java:312)
[info] 	at org.apache.spark.sql.connector.catalog.BufferedRowsReader.extractFieldValue(InMemoryBaseTable.scala:704)
[info] 	at org.apache.spark.sql.connector.catalog.BufferedRowsReader.$anonfun$get$1(InMemoryBaseTable.scala:678)
[info] 	at org.apache.spark.sql.connector.catalog.BufferedRowsReader.$anonfun$get$1$adapted(InMemoryBaseTable.scala:677)

cloud-fan avatar Feb 26 '24 07:02 cloud-fan

@cloud-fan fixed the test failure, should be ready to merge now

stefankandic avatar Feb 26 '24 13:02 stefankandic

thanks, merging to master!

cloud-fan avatar Feb 26 '24 13:02 cloud-fan