[SPARK-48241][SQL] CSV parsing failure with char/varchar type columns
What changes were proposed in this pull request?
CSV table containing char and varchar columns will result in the following error when selecting from the CSV table:
spark-sql (default)> show create table test_csv;
CREATE TABLE default.test_csv (
id INT,
name CHAR(10))
USING csv
java.lang.IllegalArgumentException: requirement failed: requiredSchema (struct<id:int,name:string>) should be the subset of dataSchema (struct<id:int,name:string>).
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.<init>(UnivocityParser.scala:56)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
Why are the changes needed?
For char and varchar types, Spark will convert them to StringType in CharVarcharUtils.replaceCharVarcharWithStringInSchema and record __CHAR_VARCHAR_TYPE_STRING in the metadata.
The reason for the above error is that the StringType columns in the dataSchema and requiredSchema of UnivocityParser are not consistent. The StringType in the dataSchema has metadata, while the metadata in the requiredSchema is empty. We need to retain the metadata when resolving schema.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Add a new test case in CSVSuite.
Was this patch authored or co-authored using generative AI tooling?
No.
Hi @ulysses-you Could you help review?
thanks, merging to master/~3.5~!
it has conflicts with 3.5, can you create a new backport PR?
it has conflicts with 3.5, can you create a new backport PR?
Create a backport PR in https://github.com/apache/spark/pull/46565.