spark-rapids [FEA] Support spark.sql.readSideCharPadding for CHAR columns with Spark 3.4+

Spark 3.4 changed the semantics of reading CHAR columns from ORC files.

Consider the following table:

CREATE TABLE foobar ( foo char(3) ) STORED AS ORCFILE LOCATION '/tmp/foobar';

INSERT INTO FOOBAR VALUES (""), ("0"), ("1 "), (" 1"), ("22"), ("4444"), (NULL);

When this data is read from Spark < 3.4, it returns:

  SELECT foo, CONCAT ('"', foo, '"'), LENGTH(foo) FROM foobar;
        ""      0
0       "0"     1
1       "1"     1
 1      " 1"    2
22      "22"    2
444     "444"   3
NULL    NULL    NULL

With Spark 3.4, this changes to:

  SELECT foo, CONCAT ('"', foo, '"'), LENGTH(foo) FROM foobar;
        "   "   3
0       "0  "   3
1       "1  "   3
 1      " 1 "   3
22      "22 "   3
444     "444"   3
NULL    NULL    NULL

It would be good to support the new behaviour with the Spark RAPIDS plugin.

(This is incidental fallout from #8321. This behaviour needs to be moved to Shims now.)

May 18 '23 21:05 mythrocks

This is interesting. Here is the "failing" read plan:

GpuColumnarToRow false
+- GpuProject [foo#24, gpuconcat(", foo#24, ") AS concat(", foo, ")#21, length(foo#24) AS length(foo)#22]
   +- GpuRowToColumnar targetsize(1073741824)
      +- *(1) Project [staticinvoke(class org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, readSidePadding, foo#20, 3, true, false, true) AS foo#24]
         +- GpuColumnarToRow false
            +- GpuFileGpuScan orc spark_catalog.default.foobar[foo#20] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryFileIndex[file:/tmp/foobar], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<foo:string>

The read isn't strictly failing on GPU. The GPU read presents right-trimmed strings (like in Spark 3.3). The CPU then adds the spaces back, to pad back to expected width, via CharVarcharCodegenUtils.readSidePadding()

The worst of this is that it falls off the GPU (column->row->column), and then does string padding on CPU.

We don't produce bad reads. But we could choose to go much faster, if we just presented what the CUDF reader reads. But then we would have to intercept code-gen. This might be a bit of work.

May 18 '23 23:05 mythrocks

I've set this to low priority. There is no data corruption, bad read, etc.

May 18 '23 23:05 mythrocks

This issue is not limited to ORC, but also Parquet or any other supported storage. It is controlled by https://github.com/apache/spark/blob/7a1608bbc3f1dfd7ffd1f9dc762cb369f47a8d43/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L4628-L4634

Jul 02 '24 03:07 gerashegalov