spark
spark copied to clipboard
[SPARK-44811][BUILD] Upgrade Guava to 33.1.0-jre
What changes were proposed in this pull request?
This PR demonstrates the possibility and steps of upgrading Spark's built-in Guava from 14 to 32+.
Currently, Spark uses Guava 14 because the previous built-in Hive 2.3.9 is incompatible with new Guava versions. HIVE-27560 (https://github.com/apache/hive/pull/4542) makes Hive 2.3.10 compatible with Guava 14+ (thanks to @LuciferYang)
Why are the changes needed?
It's a long-standing issue, see prior discussions at https://github.com/apache/spark/pull/35584, https://github.com/apache/spark/pull/36231, and https://github.com/apache/spark/pull/33989
Does this PR introduce any user-facing change?
Yes, some user-faced error messages changed.
How was this patch tested?
GA passed.
Thank you for your efforts on this. @pan3793
cc @sunchao @JoshRosen @HyukjinKwon @dongjoon-hyun @srowen @mridulm, what do you think about this approach?
We should probably do something like this for 4.0, yes. Why does it require removing the isolated classloader? That's the only part I'm concerned about breaking something else.
We should probably do something like this for 4.0, yes.
@srowen yes, it's target to 4.0
Why does it require removing the isolated classloader?
Guava is marked as shared, which means that Guava is not isolated totally. (I don't fully understand the background, but the fact is, it's required, excluding it from shared classes breaks the test)
https://github.com/apache/spark/blob/afcccb42c96cc7785a85c9463523f34d7a900a1d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L205C3-L219
So technically, if we want to upgrade Guava while keeping IsolatedClientLoader
, we should make all Hive versions compatible with the new Guava, it's impossible.
And as I comment in https://www.mail-archive.com/[email protected]/msg30708.html, I think the IsolatedClientLoader
is not required after dropping Java 8
Dropping support for Java 8 means dropping support for Hive lower than 2.0(exclusive)[1].
IsolatedClientLoader aims to allow using different Hive jars to communicate with different versions of HMS. AFAIK, the current built-in Hive 2.3.9 client works well on communicating with the Hive Metastore server through 2.1 to 3.1 (maybe 2.0 too, not sure). This brings a new question, is the IsolatedClientLoader required then?
I think we should drop IsolatedClientLoader because
- As explained above, we can use the built-in Hive 2.3.9 client to communicate with HMS 2.1+
- Since SPARK-42539[2], the default Hive 2.3.9 client does not use IsolatedClientLoader, and as explained in SPARK-42539, IsolatedClientLoader causes some inconsistent behaviors.
- It blocks Guava upgrading. HIVE-27560[3] aim to make Hive 2.3.10(unreleased) compatible with all Guava 14+ versions, but unfortunately, Guava is marked as
isSharedClass
[3] in IsolatedClientLoader, so technically, if we want to upgrade Guava we need to make all supported Hive versions(through 2.0.x to 3.1.x) to support high version of Guava, I think it's impossible.[1] sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientVersions.scala [2] https://issues.apache.org/jira/browse/SPARK-42539 [3] https://issues.apache.org/jira/browse/HIVE-27560 [4] https://github.com/apache/spark/pull/33989#issuecomment-926277286
Dropping IsolatedClientLoader
is also my main concern here - I'm not sure whether anyone in the community still rely on the feature for some reason (cc @xkrogen). I do agree that Hive 2.3.9/10 should be compatible with most Hive versions.
Perhaps it's worth checking out in the dev mailing list, if it's not done already.
Perhaps it's worth checking out in the dev mailing list, if it's not done already.
@sunchao I posted the proposal in the mailing list thread What else could be removed in Spark 4?.
I do agree that Hive 2.3.9/10 should be compatible with most Hive versions.
Yes, I can contribute some use cases.
- We use Hive 2.3.9 to access HMS 2.1/2.3 internally for several years.
- In default setup of E-MapReduce of AliCloud, it uses Spark 3.3.1(Hive 2.3.9) to access HMS 3.1.
Dropping
IsolatedClientLoader
is also my main concern here - I'm not sure whether anyone in the community still rely on the feature for some reason (cc @xkrogen). I do agree that Hive 2.3.9/10 should be compatible with most Hive versions.
Thanks for the ping @sunchao! We're still using it in Spark 3.1.1 to access Hive 1.1, but we're working to get rid of it after the compatibility work you did on the HMS side (thanks for that!). Our only blocker has been some rollout concerns, nothing from the compatibility angle. So, good from my perspective.
Sharing a real case from my company. Using Spark version 3.2 in production environment, using buildin's Hive Client version 2.3.9 with two patches, communicating with Hive meta store 1.1.0, it has been working fine for three years now.
HIVE-25500 Switch back to alter_partition(s) in HMS client for Hive 2.3.x [SPARK-44454][SQL][HIVE] HiveShim getTablesByType support fallback
Regarding the proposed removal of IsolatedClientLoader
, I am concerned that doing so might break some users' ability to interface with very old Hive metastores:
Although newer Hive client versions are compatible with a wider range of metastore versions, I am unsure whether they cover all old versions. If users are relying on Hive 0.13, for example, then I am concerned that removing IsolatedClientLoader
might prevent those users from being able to upgrade Spark without first upgrading their metastore.
As was pointed out upthread in https://github.com/apache/spark/pull/42493#issuecomment-1678857243, though, those old Hive versions require Java 8. If Spark 4.0 drops JDK8 support then we'd already lose support for Hive < 2.0 using out-of-the-box Hive metastore clients.
However, some users run forked versions of old Hive which do support newer JDKs and for them the IsolatedClientLoader
removal would create pains for their Spark 4.0 upgrade. Since those users are already forking Hive, though, I suppose they could simply recompile their Hive fork against Spark's newer Guava version.
That said, I wonder whether we can decouple the IsolatedClientLoader
removal decision from the Guava upgrade by removing Guava from sharedClasses
(previously discussed at https://github.com/apache/spark/pull/33989#issuecomment-928616327). This could also prove useful in case we uncover some other reason why we might want to continue using IsolatedClientLoader
.
I think Guava would only need to be shared if we were passing instances of Guava classes across the classloader boundary, but I don't think we're doing that.
I tried removing Guava from sharedClasses
and used SBT to run org.apache.spark.sql.hive.client.HiveClientSuites
and tests seem to pass. Given this, maybe we could try kicking off a full CI run to see if we get any failures there? I think it may be worth re-testing here in order to verify our assumptions about Guava and the IsolatedClientLoader
.
@JoshRosen thanks for the comments, https://github.com/apache/spark/pull/42599 is opened to try removing Guava from sharedClasses
, let's wait for CI results.
If Spark 4.0 drops JDK8 support then we'd already lose support for Hive < 2.0 using out-of-the-box Hive metastore clients.
Yes, definitely, from the community perspective, that means Spark 4.0 drops support for HMS < 2.0.
However, some users run forked versions of old Hive which do support newer JDKs and for them the
IsolatedClientLoader
removal would create pains for their Spark 4.0 upgrade. Since those users are already forking Hive, though, I suppose they could simply recompile their Hive fork against Spark's newer Guava version.
I think it's out of the scope of the Spark community's responsibility. And it's very weird to combine Hive 0.13(released in 2014) with Spark 4.0(will release in 2024), considering that Spark 4.0 decided to drop support for Java 8 and Hadoop 2, I don't think we should insist to support such an old Hive version.
I definitely see lots of users fork and maintain their internal Hadoop/Hive/Spark versions, but as you said, they could simply recompile their Hive fork against Spark's newer Guava version, I suppose they also could recompile their Spark fork against their Hive fork.
In short, I think it does not make sense that a user expects Spark 4.0.0 should work out-of-box with a forked Hive released a decade ago. He/she should expect some code changes on their Spark fork to make such a combination work.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!
UPDATE
Dropping IsolatedClientLoader
is not necessary after we dropped support for Hive prior 2.0.0 in SPARK-45328, see technical details at https://github.com/apache/spark/pull/42599.
I think there are no known technical issues blocking us from upgrading Guava while keeping IsolatedClientLoader in Spark 4.0.0
Summary of upgrading Guava steps:
- SPARK-45292: Remove Guava from shared classes from IsolatedClientLoader https://github.com/apache/spark/pull/42599
- Wait for Hive 2.3.10 for HIVE-27560
- Upgrade Spark built-in Guava to 32+
Thank you for updating this, @pan3793 .
@JoshRosen @srowen @sunchao @dongjoon-hyun @LuciferYang @HyukjinKwon I think we are ready to upgrade Guava to the modern version, could you please take a look?
should we test 33.2.0?
should we test 33.2.0?
33.1.0-jre
is chosen because Spark uses 33.1.0-jre
in other places, and after this change, we can unify all Guava versions.
should we test 33.2.0?
33.1.0-jre
is chosen because Spark uses33.1.0-jre
in other places, and after this change, we can unify all Guava versions.
OK
Is there a possibility that upgrading Guava could carry similar risks as described in https://github.com/apache/spark/pull/46528 ? Could there be legacy Hive UDF jars that are bound to Guava 14.0.1?
WDYT @dongjoon-hyun ?
Should we continue to push this one forward?
@dongjoon-hyun I rebased this PR and updated the description. Guava upgrading was requested many times from various users, it would be great if we could deliver this Guava upgradig in the coming Spark 4.0.0-preview2
I agree with you, @pan3793 . I hope we can try in preview2.
Merged to master for Apache Spark 4.0.0-preview2.
Thank you, @pan3793 and @LuciferYang .
I want to ping you guys for head-ups once more, @JoshRosen , @cloud-fan , @HyukjinKwon , @yaooqinn , @mridulm , @viirya , @huaxingao .