hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-7915] Spark 4 support

Open wombatu-kun opened this issue 9 months ago • 19 comments

Change Logs

As this PR https://github.com/apache/hudi/pull/11539 is not updated more than half a year, I pulled the code from there, rebased on actual master, resolved conflicts and compilation issues, fixed tests.

This PR adds basic Spark 4 support to Hudi (which is built with scala2.13, spark4.0-preview1, tested with all java/scala UTs and FTs) and there's still a lot of work to do. But I think it's better to continue this work in separate PRs. As a follow up tasks we need:

  • configure integration-tests pipeline (newer versions of docker images is needed for that);
  • test it on Azure.

In Spark4 we have 4 main changes (comparing to Spark3) which caused most of the work in this implementation:

  1. [SPARK-46832] UTF8String doesn't support compareTo anymore and throws UnsupportedOperationException (binaryCompare should be used instead).
  2. New data type Variant is introduced, so we have to implement public VariantVal getVariant(int ordinal) in InternalRow's subclasses. As HoodiePartitionValues extends InternalRow and HoodiePartitionCDCFileGroupMapping extends HoodiePartitionValues and HoodiePartitionFileSliceMapping extends HoodiePartitionValues we must provide different implementations of this class hierarchy (and HoodieInternalRow of course) for Spark 3 and Spark 4.
  3. Access to the simple constructor of AnalysisException is changed from protected[sql] to just protected, so we can not instantiate it like new AnalysisException("errmsg") anymore and have to introduce our own HoodieAnalysisException, which extends AnalysisException.
  4. Spark's ParseException does not have a common constructor that suits all (3.3, 3.4, 3.5, 4.0) Spark versions anymore (details https://github.com/apache/hudi/pull/12772#discussion_r2090479911).

Changes in the code of hudi-spark4.0.x module compared to hudi-spark3.5.x:

  • HoodieSpark40CatalystPlanUtils: line 42, method unapplyMergeIntoTable (7 params in pattern matching instead of 6)
  • HoodieSpark4_0ExtendedSqlParser: line 116 (instantiating of ParseException is different)
  • HoodieSpark4_0ExtendedSqlAstBuilder: lines 517, 1786, 3320
  • Spark40LegacyHoodieParquetFileFormat: method buildReaderWithPartitionValues
  • Spark4_0Adapter: method getSchema, return types of other methods are version specific

Other classes in module are the same or nearly the same as in hudi-spark3.5.x, but we can't move them to hudi-spark-common because they have differences with hudi-spark3.4.x or hudi-spark3.3.x.

UPD: Changes in Spark4.0.0 comparing to Spark4.0.0-preview1:

  • SparkSession.close() throws IOException
  • SQLContext.setConf(key: String, value: String)
  • LogicalRelation.unapply has 5 args instead of 4
  • LogicalRDD.unapply has 6 args instead of 5
  • SparkSession, SQLContext, Dataset, DataFrame & etc from package org.apache.spark.sql are now abstract/interfaces, old concrete classes extend/implement them and locate in org.apache.spark.sql.classic package
  • method parseRoutineParam is added to ParseInterface (not implemented yet)
  • new Column(expr) is not available anymore, we have to convert Expression into ColumnNode
  • The logical plan of the ALTER TABLE ... ALTER COLUMN command is AlterColumns now (instead of AlterColumn)
  • WithWindowDefinition class constructor has 3 args now
  • TableSpec has new arg "collation"
  • PartitionedFileUtil.splitFiles has Path as a second arg
  • file format versions: parquet - 1.15.2, orc - 2.1.2, avro - 1.12.0

When bumping parquet version to 1.15.2 (which is mandatory for Spark 4.0.0) we have to change type on first argument of HoodieAvroReadSupport.init() from org.apache.hadoop.conf.Configuration to org.apache.parquet.conf.ParquetConfiguration.

Impact

Spark 4 support

Risk level (write none, low medium or high below)

low

Documentation Update

none

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

wombatu-kun avatar Feb 05 '25 08:02 wombatu-kun

Thanks for taking this up!

CTTY avatar Feb 08 '25 00:02 CTTY

Hi guys @yihua @nsivabalan @danny0405 @jonvex @CTTY . PR is ready to review.
I kindly ask you to review it as soon as possible for 3 reasons:

  • only after Spark 4 support is merged, we can start development of Spark Variant data type support, for which the feature request is already made (https://github.com/apache/hudi/issues/12022);
  • for all tests to pass in the future, every further changes in hudi-spark3-common, hudi-spark3.x.x modules requires corresponding implementation in hudi-spark4-common, hudi-spark4.0.x, so the author of these changes should be responsible for that, not me, please.
  • PR is really huge, and further frequent rebases will be time consuming for me (I better start Variant data type support implementation).

wombatu-kun avatar Feb 13 '25 09:02 wombatu-kun

@wombatu-kun thanks for the contribution! I'll review this PR in the next few days.

yihua avatar Feb 25 '25 06:02 yihua

As i see, Hudi community is not in a hurry about supporting Spark4 and Variant data type, even when Iceberg/Delta have already implemented it.

wombatu-kun avatar Mar 24 '25 02:03 wombatu-kun

@yihua Hello Ethan. Is this PR needed or not? Could it be merged in the nearest future? I'm tired of rebasing and resolving conflicts every week.

wombatu-kun avatar Mar 27 '25 13:03 wombatu-kun

@hudi-bot run azure

wombatu-kun avatar Apr 01 '25 00:04 wombatu-kun

@wombatu-kun Please be patient as of now, the community was making a haste for working out the Hudi 1.0.2 release recently, I will take some time to review this PR next week.

danny0405 avatar Apr 05 '25 10:04 danny0405

@yihua Hello Ethan. Is this PR needed or not? Could it be merged in the nearest future? I'm tired of rebasing and resolving conflicts every week.

Sorry for the delay. Spark 4 support is definitely needed. The community is wrapping up the work for Hudi 1.0.2 release. I'll review this PR soon.

yihua avatar Apr 21 '25 18:04 yihua

@hudi-bot run azure

wombatu-kun avatar May 03 '25 14:05 wombatu-kun

@danny0405 @yihua hi guys! Thank you for review! I've rebased branch and answered/addressed your comments, all tests pass! I hope you don't have any more comments and we could finally merge this huge PR while there is no conflicts. If you have more comments, may be we could address them in separate jiras?

wombatu-kun avatar May 04 '25 01:05 wombatu-kun

It looks like some tests are not being executed for this PR in Azure CI by looking at the number of tests passed. Screenshot 2025-05-06 at 13 20 59 Screenshot 2025-05-06 at 13 21 04

yihua avatar May 06 '25 20:05 yihua

I finished my first pass of line-by-line reviews. A few major issues need to be addressed.

yihua avatar May 06 '25 21:05 yihua

@hudi-bot run azure

wombatu-kun avatar May 30 '25 06:05 wombatu-kun

@wombatu-kun I see a lot of complexities are brought by the InternalRow variant data type and Utf8String, it's great if we can limit the changes just in the hudi-spark-datasource/hudi-spark4.0.x module(by copying the referenced utility class/method or maybe maintain a separate module for these incompatible classes) so we have enough confidence to land it quickly, some compatibility issues can be addressed by the Sparkx_xAdapter I guess.

danny0405 avatar Jun 04 '25 02:06 danny0405

@wombatu-kun I see a lot of complexities are brought by the InternalRow variant data type and Utf8String, it's great if we can limit the changes just in the hudi-spark-datasource/hudi-spark4.0.x module(by copying the referenced utility class/method or maybe maintain a separate module for these incompatible classes) so we have enough confidence to land it quickly, some compatibility issues can be addressed by the Sparkx_xAdapter I guess.

And all these complexities are brought with only Spark 4.0.0-preview1 version, but with released Spark 4.0.0 the situation becomes even worse because there are lots of breaking changes: many often-used classes were moved to different package (e.g. SparkSession, SQLContext, Dataset that are used in Hudi now locate in org.apache.spark.sql.classic package), new args were added to some constructors or unapply methods (e.g. LogicalRDD, LogicalRelation) etc. These changed classes that are the basic APIs for integration with Spark are frequently used even in hudi-spark-client (fundamental common module for all Spark versions, you know).
So if we want to avoid a lot of complexities brought by the changes in Spark 4.0.0 and avoid any risks of breaking compatibility or performance issues with spark 3.x, we have to make module hudi-spark4.0.x kinda self-contained: copy the code of hudi-spark-client and hudi-spark-common to hudi-spark4.0.x (and remove dependencies on them from hudi-spark4.0.x module), make all classes compatible with Spark 4.0.0 release in this 'super' module. There would be a lot of copy-pasta in hudi-spark4.0.x, but no Spark3.x code would change at all and we would have Spark 4 support working.

@yihua says it's unmaintainable to copy classes as suggested, but i don't see any better way to have Spark 4 support and to not complicate existing Spark 3.x code.

@danny0405 @yihua let's make a decision here and now.
I can create this self-contained Spark4.0.x module in new PR if you decide that it's a convenient way.

Btw, Apache Iceberg organizes support for all Spark versions just like that: one version of Spark = one iceberg-spark submodule, no common spark-related code is shared between these modules, and it doesn't seem they have significant problems with maintenance. And they have Spark 4.0.0 support already (by that I mean they have enough time to maintain the old 'copy-pasted' versions of Spark and to deliver support for new version in time).

wombatu-kun avatar Jun 05 '25 02:06 wombatu-kun

And all these complexities are brought with only Spark 4.0.0-preview1 version, but with released Spark 4.0.0 the situation becomes even worse because there are lots of breaking changes: many often-used classes were moved to different package (e.g. SparkSession, SQLContext, Dataset that are used in Hudi now locate in org.apache.spark.sql.classic package), new args were added to some constructors or unapply methods (e.g. LogicalRDD, LogicalRelation) etc. These changed classes that are the basic APIs for integration with Spark are frequently used even in hudi-spark-client (fundamental common module for all Spark versions).

A bit more details about the changes that we have to make in Hudi while switching Spark dependencies from 4.0.0-preview1 to 4.0.0: for hudi-spark-client to compile (i don't talk about tests passing, only successful compilation of this module) with Spark 4.0.0 dependencies we need to change ~30 files in this module (mostly, fixing imports of SparkSession, SQLContext, DataSet and DataFrame classes from org.apache.spark.sql to org.apache.spark.sql.classic).
So, if we want these classes to compile with both Spark3,x and Spark4 (and don't want to make hudi-spark4.0.x separated and self-contained), we have to move them (without changes) to hudi-spark3-common, copy them (with changed imports) to hudi-spark4.0.x, and add ~30 methods to SparkAdapter to work with these classes depending on Spark version.

wombatu-kun avatar Jun 06 '25 02:06 wombatu-kun

we have to make module hudi-spark4.0.x kinda self-contained: copy the code of hudi-spark-client and hudi-spark-common to hudi-spark4.0.x

Another idea is we continue to maintain hudi-spark-client-4.0.x, hudi-spark-common-4.0.x, but indeed the maintainability issue will occur here because each time we made some changes around, we need to duplicate the logic in 2 files (for 3.x and 4.x spark).

@yihua We need to reach consensus here first before @wombatu-kun can start the next step. @wombatu-kun I will try to drive the discusstion around the module maintainability topic next week.

danny0405 avatar Jun 08 '25 00:06 danny0405

@yihua We need to reach consensus here first before @wombatu-kun can start the next step.
@wombatu-kun I will try to drive the discussion around the module maintainability topic next week.

hi guys!
@yihua @danny0405 Have you reached the consensus about Spark4-module maintainability?

I've adapted Hudi code for Spark 4.0.0 (only for this version, not 3.x), temporarily added separate CI-pipeline (test-spark4-java17-all-tests) with all Spark-related groups of UT/FTs running with scala2.13, Spark 4.0.0, and it passes successfully! You can take a look here: https://github.com/apache/hudi/actions/runs/15722368819/job/44305552368?pr=12772
Also updated the description of this PR by adding info about changes in Spark 4.0.0 comparing to 4.0.0-preview1.

Now I'm ready to start integrating these changes with other Spark's versions, but waiting for your decision

wombatu-kun avatar Jun 18 '25 07:06 wombatu-kun

@yihua @danny0405 hi guys! i've adapted this PR for Spark 4.0.0 (released version). Now all UTs/FTs pass with new and old versions of Spark. Review please

wombatu-kun avatar Jul 01 '25 13:07 wombatu-kun

@wombatu-kun I pushed one commit on fixing the Dockerfile which is required for building the image successfully.

yihua avatar Jul 24 '25 20:07 yihua

Hi @wombatu-kun the new image is uploaded for flink1200hive313spark400scala213.

yihua avatar Jul 30 '25 22:07 yihua

@hudi-bot run azure

wombatu-kun avatar Aug 11 '25 03:08 wombatu-kun

@yihua Hi Ethan! All tests pass successfully except Hive sync in validate-bundles with Spark 4 (which i'm trying to figure out with). So you may start your second pass of reviews.

wombatu-kun avatar Aug 15 '25 10:08 wombatu-kun

@hudi-bot run azure

wombatu-kun avatar Aug 20 '25 04:08 wombatu-kun

@hudi-bot run azure

wombatu-kun avatar Aug 20 '25 05:08 wombatu-kun

I addressed all comments. There are a handful of comments to revisit but they are not blocking this PR from merging. Still need to fix TestSparkSortAndSizeClustering on Spark 4.

yihua avatar Sep 21 '25 09:09 yihua

It looks like only local timestamp types are not supported on Spark 4. So making that as a limitation for now and disabling the validation on local timestamp types in TestSparkSortAndSizeClustering .

yihua avatar Sep 21 '25 09:09 yihua

The number of tests executed in Azure CI (excluding an intentional disabled test) remains the same, compared to master. Screenshot 2025-09-21 at 15 49 13 Screenshot 2025-09-21 at 15 48 34

yihua avatar Sep 21 '25 22:09 yihua

CI report:

  • 7a164d361c535612488c1f0685b4cf1348d07587 Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Sep 21 '25 23:09 hudi-bot

The sanity test of running Spark Quick Start on Spark 4.0.1 with Scala version 2.13.16 (OpenJDK 64-Bit Server VM, Java 17.0.5) passes.

yihua avatar Sep 22 '25 00:09 yihua