hudi
hudi copied to clipboard
[HUDI-7915] Spark 4 support
Change Logs
As this PR https://github.com/apache/hudi/pull/11539 is not updated more than half a year, I pulled the code from there, rebased on actual master, resolved conflicts and compilation issues, fixed tests.
This PR adds basic Spark 4 support to Hudi (which is built with scala2.13, spark4.0-preview1, tested with all java/scala UTs and FTs) and there's still a lot of work to do. But I think it's better to continue this work in separate PRs. As a follow up tasks we need:
- configure integration-tests pipeline (newer versions of docker images is needed for that);
- test it on Azure.
In Spark4 we have 4 main changes (comparing to Spark3) which caused most of the work in this implementation:
- [SPARK-46832]
UTF8Stringdoesn't supportcompareToanymore and throws UnsupportedOperationException (binaryCompareshould be used instead). - New data type
Variantis introduced, so we have to implementpublic VariantVal getVariant(int ordinal)inInternalRow's subclasses. AsHoodiePartitionValues extends InternalRowandHoodiePartitionCDCFileGroupMapping extends HoodiePartitionValuesandHoodiePartitionFileSliceMapping extends HoodiePartitionValueswe must provide different implementations of this class hierarchy (andHoodieInternalRowof course) for Spark 3 and Spark 4. - Access to the simple constructor of
AnalysisExceptionis changed fromprotected[sql]to justprotected, so we can not instantiate it likenew AnalysisException("errmsg")anymore and have to introduce our ownHoodieAnalysisException, which extendsAnalysisException. - Spark's
ParseExceptiondoes not have a common constructor that suits all (3.3, 3.4, 3.5, 4.0) Spark versions anymore (details https://github.com/apache/hudi/pull/12772#discussion_r2090479911).
Changes in the code of hudi-spark4.0.x module compared to hudi-spark3.5.x:
HoodieSpark40CatalystPlanUtils: line 42, method unapplyMergeIntoTable (7 params in pattern matching instead of 6)HoodieSpark4_0ExtendedSqlParser: line 116 (instantiating of ParseException is different)HoodieSpark4_0ExtendedSqlAstBuilder: lines 517, 1786, 3320Spark40LegacyHoodieParquetFileFormat: method buildReaderWithPartitionValuesSpark4_0Adapter: methodgetSchema, return types of other methods are version specific
Other classes in module are the same or nearly the same as in hudi-spark3.5.x, but we can't move them to hudi-spark-common because they have differences with hudi-spark3.4.x or hudi-spark3.3.x.
UPD: Changes in Spark4.0.0 comparing to Spark4.0.0-preview1:
- SparkSession.close() throws IOException
- SQLContext.setConf(key: String, value: String)
- LogicalRelation.unapply has 5 args instead of 4
- LogicalRDD.unapply has 6 args instead of 5
- SparkSession, SQLContext, Dataset, DataFrame & etc from package
org.apache.spark.sqlare now abstract/interfaces, old concrete classes extend/implement them and locate inorg.apache.spark.sql.classicpackage - method
parseRoutineParamis added toParseInterface(not implemented yet) new Column(expr)is not available anymore, we have to convertExpressionintoColumnNode- The logical plan of the ALTER TABLE ... ALTER COLUMN command is
AlterColumnsnow (instead ofAlterColumn) WithWindowDefinitionclass constructor has 3 args nowTableSpechas new arg "collation"PartitionedFileUtil.splitFileshas Path as a second arg- file format versions: parquet -
1.15.2, orc -2.1.2, avro -1.12.0
When bumping parquet version to 1.15.2 (which is mandatory for Spark 4.0.0) we have to change type on first argument of HoodieAvroReadSupport.init() from org.apache.hadoop.conf.Configuration to org.apache.parquet.conf.ParquetConfiguration.
Impact
Spark 4 support
Risk level (write none, low medium or high below)
low
Documentation Update
none
- The config description must be updated if new configs are added or the default value of the configs are changed
- Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
Thanks for taking this up!
Hi guys @yihua @nsivabalan @danny0405 @jonvex @CTTY . PR is ready to review.
I kindly ask you to review it as soon as possible for 3 reasons:
- only after Spark 4 support is merged, we can start development of Spark Variant data type support, for which the feature request is already made (https://github.com/apache/hudi/issues/12022);
- for all tests to pass in the future, every further changes in
hudi-spark3-common,hudi-spark3.x.xmodules requires corresponding implementation inhudi-spark4-common,hudi-spark4.0.x, so the author of these changes should be responsible for that, not me, please. - PR is really huge, and further frequent rebases will be time consuming for me (I better start Variant data type support implementation).
@wombatu-kun thanks for the contribution! I'll review this PR in the next few days.
As i see, Hudi community is not in a hurry about supporting Spark4 and Variant data type, even when Iceberg/Delta have already implemented it.
@yihua Hello Ethan. Is this PR needed or not? Could it be merged in the nearest future? I'm tired of rebasing and resolving conflicts every week.
@hudi-bot run azure
@wombatu-kun Please be patient as of now, the community was making a haste for working out the Hudi 1.0.2 release recently, I will take some time to review this PR next week.
@yihua Hello Ethan. Is this PR needed or not? Could it be merged in the nearest future? I'm tired of rebasing and resolving conflicts every week.
Sorry for the delay. Spark 4 support is definitely needed. The community is wrapping up the work for Hudi 1.0.2 release. I'll review this PR soon.
@hudi-bot run azure
@danny0405 @yihua hi guys! Thank you for review! I've rebased branch and answered/addressed your comments, all tests pass! I hope you don't have any more comments and we could finally merge this huge PR while there is no conflicts. If you have more comments, may be we could address them in separate jiras?
It looks like some tests are not being executed for this PR in Azure CI by looking at the number of tests passed.
I finished my first pass of line-by-line reviews. A few major issues need to be addressed.
@hudi-bot run azure
@wombatu-kun I see a lot of complexities are brought by the InternalRow variant data type and Utf8String, it's great if we can limit the changes just in the hudi-spark-datasource/hudi-spark4.0.x module(by copying the referenced utility class/method or maybe maintain a separate module for these incompatible classes) so we have enough confidence to land it quickly, some compatibility issues can be addressed by the Sparkx_xAdapter I guess.
@wombatu-kun I see a lot of complexities are brought by the
InternalRowvariant data type andUtf8String, it's great if we can limit the changes just in thehudi-spark-datasource/hudi-spark4.0.xmodule(by copying the referenced utility class/method or maybe maintain a separate module for these incompatible classes) so we have enough confidence to land it quickly, some compatibility issues can be addressed by theSparkx_xAdapterI guess.
And all these complexities are brought with only Spark 4.0.0-preview1 version, but with released Spark 4.0.0 the situation becomes even worse because there are lots of breaking changes: many often-used classes were moved to different package (e.g. SparkSession, SQLContext, Dataset that are used in Hudi now locate in org.apache.spark.sql.classic package), new args were added to some constructors or unapply methods (e.g. LogicalRDD, LogicalRelation) etc. These changed classes that are the basic APIs for integration with Spark are frequently used even in hudi-spark-client (fundamental common module for all Spark versions, you know).
So if we want to avoid a lot of complexities brought by the changes in Spark 4.0.0 and avoid any risks of breaking compatibility or performance issues with spark 3.x, we have to make module hudi-spark4.0.x kinda self-contained: copy the code of hudi-spark-client and hudi-spark-common to hudi-spark4.0.x (and remove dependencies on them from hudi-spark4.0.x module), make all classes compatible with Spark 4.0.0 release in this 'super' module. There would be a lot of copy-pasta in hudi-spark4.0.x, but no Spark3.x code would change at all and we would have Spark 4 support working.
@yihua says it's unmaintainable to copy classes as suggested, but i don't see any better way to have Spark 4 support and to not complicate existing Spark 3.x code.
@danny0405 @yihua let's make a decision here and now.
I can create this self-contained Spark4.0.x module in new PR if you decide that it's a convenient way.
Btw, Apache Iceberg organizes support for all Spark versions just like that: one version of Spark = one iceberg-spark submodule, no common spark-related code is shared between these modules, and it doesn't seem they have significant problems with maintenance. And they have Spark 4.0.0 support already (by that I mean they have enough time to maintain the old 'copy-pasted' versions of Spark and to deliver support for new version in time).
And all these complexities are brought with only Spark 4.0.0-preview1 version, but with released Spark 4.0.0 the situation becomes even worse because there are lots of breaking changes: many often-used classes were moved to different package (e.g.
SparkSession,SQLContext,Datasetthat are used in Hudi now locate inorg.apache.spark.sql.classicpackage), new args were added to some constructors or unapply methods (e.g.LogicalRDD,LogicalRelation) etc. These changed classes that are the basic APIs for integration with Spark are frequently used even inhudi-spark-client(fundamental common module for all Spark versions).
A bit more details about the changes that we have to make in Hudi while switching Spark dependencies from 4.0.0-preview1 to 4.0.0: for hudi-spark-client to compile (i don't talk about tests passing, only successful compilation of this module) with Spark 4.0.0 dependencies we need to change ~30 files in this module (mostly, fixing imports of SparkSession, SQLContext, DataSet and DataFrame classes from org.apache.spark.sql to org.apache.spark.sql.classic).
So, if we want these classes to compile with both Spark3,x and Spark4 (and don't want to make hudi-spark4.0.x separated and self-contained), we have to move them (without changes) to hudi-spark3-common, copy them (with changed imports) to hudi-spark4.0.x, and add ~30 methods to SparkAdapter to work with these classes depending on Spark version.
we have to make module hudi-spark4.0.x kinda self-contained: copy the code of hudi-spark-client and hudi-spark-common to hudi-spark4.0.x
Another idea is we continue to maintain hudi-spark-client-4.0.x, hudi-spark-common-4.0.x, but indeed the maintainability issue will occur here because each time we made some changes around, we need to duplicate the logic in 2 files (for 3.x and 4.x spark).
@yihua We need to reach consensus here first before @wombatu-kun can start the next step. @wombatu-kun I will try to drive the discusstion around the module maintainability topic next week.
@yihua We need to reach consensus here first before @wombatu-kun can start the next step.
@wombatu-kun I will try to drive the discussion around the module maintainability topic next week.
hi guys!
@yihua @danny0405 Have you reached the consensus about Spark4-module maintainability?
I've adapted Hudi code for Spark 4.0.0 (only for this version, not 3.x), temporarily added separate CI-pipeline (test-spark4-java17-all-tests) with all Spark-related groups of UT/FTs running with scala2.13, Spark 4.0.0, and it passes successfully! You can take a look here: https://github.com/apache/hudi/actions/runs/15722368819/job/44305552368?pr=12772
Also updated the description of this PR by adding info about changes in Spark 4.0.0 comparing to 4.0.0-preview1.
Now I'm ready to start integrating these changes with other Spark's versions, but waiting for your decision
@yihua @danny0405 hi guys! i've adapted this PR for Spark 4.0.0 (released version). Now all UTs/FTs pass with new and old versions of Spark. Review please
@wombatu-kun I pushed one commit on fixing the Dockerfile which is required for building the image successfully.
Hi @wombatu-kun the new image is uploaded for flink1200hive313spark400scala213.
@hudi-bot run azure
@yihua Hi Ethan! All tests pass successfully except Hive sync in validate-bundles with Spark 4 (which i'm trying to figure out with). So you may start your second pass of reviews.
@hudi-bot run azure
@hudi-bot run azure
I addressed all comments. There are a handful of comments to revisit but they are not blocking this PR from merging. Still need to fix TestSparkSortAndSizeClustering on Spark 4.
It looks like only local timestamp types are not supported on Spark 4. So making that as a limitation for now and disabling the validation on local timestamp types in TestSparkSortAndSizeClustering .
The number of tests executed in Azure CI (excluding an intentional disabled test) remains the same, compared to master.
CI report:
- 7a164d361c535612488c1f0685b4cf1348d07587 Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build
The sanity test of running Spark Quick Start on Spark 4.0.1 with Scala version 2.13.16 (OpenJDK 64-Bit Server VM, Java 17.0.5) passes.