spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] Support for wider types in read schemas for Parquet Reads

Open mythrocks opened this issue 1 year ago • 1 comments

TL;DR:

When running the plugin with Spark 4+, if a Parquet file is being read with a read-schema that contains wider types than the Parquet file's schema, the read should not fail.

Details:

This is with reference to https://github.com/apache/spark/pull/44368. Spark 4 has the ability to read Parquet files where the read-schema uses wider types than the write-schema in the file.

For instance, a Parquet file with an Integer column a should be readable with a read-schema that defines a as having a type Long.

Prior to Spark 4, this would yield a `SchemaColumnConvertNotSupportedException on Apache Spark and the plugin. After https://github.com/apache/spark/pull/44368, if the read-schema uses a wider, compatible type, there is an implicit conversion to the wider data type during the read. An incompatible type continues to fail as before.

spark-rapids's parquet_test.py::test_parquet_check_schema_compatibility integration test currently looks as follows:

def test_parquet_check_schema_compatibility(spark_tmp_path):
    data_path = spark_tmp_path + '/PARQUET_DATA'
    gen_list = [('int', int_gen), ('long', long_gen), ('dec32', decimal_gen_32bit)]
    with_cpu_session(lambda spark: gen_df(spark, gen_list).coalesce(1).write.parquet(data_path))

    read_int_as_long = StructType(
        [StructField('long', LongType()), StructField('int', LongType())])
    assert_gpu_and_cpu_error(
        lambda spark: spark.read.schema(read_int_as_long).parquet(data_path).collect(),
        conf={},
        error_message='Parquet column cannot be converted')

Spark 4's change in behaviour causes this test to fail thus:

        """
>       with pytest.raises(Exception) as excinfo:
E       Failed: DID NOT RAISE <class 'Exception'>

../../../../integration_tests/src/main/python/asserts.py:650: Failed

mythrocks avatar Sep 26 '24 21:09 mythrocks

It appears that apache/spark#43368 has found its way into Databricks 14.3. This issue looks to be a problem there as well.

mythrocks avatar Oct 04 '24 05:10 mythrocks

#11727 supports widening of decimal types. We still need to support widening of other types such as int->long, int->double and float->double mentioned in the original spark commit in this issue.

nartal1 avatar Jan 22 '25 17:01 nartal1