incubator-gluten icon indicating copy to clipboard operation
incubator-gluten copied to clipboard

Upgrade Spark33 version to Spark3.3.4

Open zwangsheng opened this issue 1 year ago • 6 comments

Description

As title

zwangsheng avatar Mar 08 '24 06:03 zwangsheng

@weiting-chen, do we have any users using Spark 3.3.1? If no, directly upgrading to 3.3.4 may be good. cc @zhouyuan.

PHILO-HE avatar Mar 11 '24 02:03 PHILO-HE

@zwangsheng @PHILO-HE do you see if we could support all minor releases for Spark 3.3?

thanks, -yuan

zhouyuan avatar Mar 11 '24 10:03 zhouyuan

Hi @zwangsheng, did you find any issue in your spark-3.3.4 env. by directly using Gluten jar built for spark-3.3.1? I think there should have some incompatible issues at least for parquet write.

PHILO-HE avatar Mar 18 '24 06:03 PHILO-HE

Hi @zwangsheng, did you find any issue in your spark-3.3.4 env. by directly using Gluten jar built for spark-3.3.1? I think there should have some incompatible issues for parquet write.

Yeah, when try to run the exists unit tests, I found Gluten had built-in ParquetFileFormat.scala, which has been changed in Spark3.3.4. This may cause parquet file read or write issue.

You can found the diff with the draft PR https://github.com/apache/incubator-gluten/pull/4897

zwangsheng avatar Mar 18 '24 06:03 zwangsheng

Hi @zwangsheng, did you find any issue in your spark-3.3.4 env. by directly using Gluten jar built for spark-3.3.1? I think there should have some incompatible issues for parquet write.

Yeah, when try to run the exists unit tests, I found Gluten had built-in ParquetFileFormat.scala, which has been changed in Spark3.3.4. This may cause parquet file read or write issue.

You can found the diff with the draft PR #4897

ParquetFileFormat.scala was ported from Spark code with a few code changes that overwrite Spark's logic. Maybe, we have to maintain two such files respectively for 3.3.1 & 3.3.4 in shims module to support the both versions. Can we just keep using 3.3.1 UTs in Gluten? Assume most UTs are shared by 3.3.1 & 3.3.4. Maybe, we can just extract some important UTs ( not covered in 3.3.1) from spark-3.3.4 and put into a new test module in gluten.

PHILO-HE avatar Mar 18 '24 09:03 PHILO-HE

ParquetFileFormat.scala was ported from Spark code with a few code changes that overwrite Spark's logic. Maybe, we have to maintain two such files respectively for 3.3.1 & 3.3.4 in shims module to support the both versions. Can we just keep using 3.3.1 UTs in Gluten? Assume most UTs are shared by 3.3.1 & 3.3.4. Maybe, we can just extract some important UTs ( not covered in 3.3.1) from spark-3.3.4 and put into a new test module in gluten.

I see. I will give a try to maintain two files. And go deep to find the difference.

zwangsheng avatar Mar 19 '24 07:03 zwangsheng