[BUG] Support for wider types in read schemas for Parquet Reads
TL;DR:
When running the plugin with Spark 4+, if a Parquet file is being read with a read-schema that contains wider types than the Parquet file's schema, the read should not fail.
Details:
This is with reference to https://github.com/apache/spark/pull/44368. Spark 4 has the ability to read Parquet files where the read-schema uses wider types than the write-schema in the file.
For instance, a Parquet file with an Integer column a should be readable with a read-schema that defines a as having a type Long.
Prior to Spark 4, this would yield a `SchemaColumnConvertNotSupportedException on Apache Spark and the plugin. After https://github.com/apache/spark/pull/44368, if the read-schema uses a wider, compatible type, there is an implicit conversion to the wider data type during the read. An incompatible type continues to fail as before.
spark-rapids's parquet_test.py::test_parquet_check_schema_compatibility integration test currently looks as follows:
def test_parquet_check_schema_compatibility(spark_tmp_path):
data_path = spark_tmp_path + '/PARQUET_DATA'
gen_list = [('int', int_gen), ('long', long_gen), ('dec32', decimal_gen_32bit)]
with_cpu_session(lambda spark: gen_df(spark, gen_list).coalesce(1).write.parquet(data_path))
read_int_as_long = StructType(
[StructField('long', LongType()), StructField('int', LongType())])
assert_gpu_and_cpu_error(
lambda spark: spark.read.schema(read_int_as_long).parquet(data_path).collect(),
conf={},
error_message='Parquet column cannot be converted')
Spark 4's change in behaviour causes this test to fail thus:
"""
> with pytest.raises(Exception) as excinfo:
E Failed: DID NOT RAISE <class 'Exception'>
../../../../integration_tests/src/main/python/asserts.py:650: Failed
It appears that apache/spark#43368 has found its way into Databricks 14.3. This issue looks to be a problem there as well.
#11727 supports widening of decimal types. We still need to support widening of other types such as int->long, int->double and float->double mentioned in the original spark commit in this issue.