spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-44811][BUILD] Upgrade Guava to 33.1.0-jre

Open pan3793 opened this issue 1 year ago • 14 comments

What changes were proposed in this pull request?

This PR demonstrates the possibility and steps of upgrading Spark's built-in Guava from 14 to 32+.

Currently, Spark uses Guava 14 because the previous built-in Hive 2.3.9 is incompatible with new Guava versions. HIVE-27560 (https://github.com/apache/hive/pull/4542) makes Hive 2.3.10 compatible with Guava 14+ (thanks to @LuciferYang)

Why are the changes needed?

It's a long-standing issue, see prior discussions at https://github.com/apache/spark/pull/35584, https://github.com/apache/spark/pull/36231, and https://github.com/apache/spark/pull/33989

Does this PR introduce any user-facing change?

Yes, some user-faced error messages changed.

How was this patch tested?

GA passed.

pan3793 avatar Aug 15 '23 02:08 pan3793

Thank you for your efforts on this. @pan3793

LuciferYang avatar Aug 15 '23 03:08 LuciferYang

cc @sunchao @JoshRosen @HyukjinKwon @dongjoon-hyun @srowen @mridulm, what do you think about this approach?

pan3793 avatar Aug 15 '23 10:08 pan3793

We should probably do something like this for 4.0, yes. Why does it require removing the isolated classloader? That's the only part I'm concerned about breaking something else.

srowen avatar Aug 15 '23 12:08 srowen

We should probably do something like this for 4.0, yes.

@srowen yes, it's target to 4.0

Why does it require removing the isolated classloader?

Guava is marked as shared, which means that Guava is not isolated totally. (I don't fully understand the background, but the fact is, it's required, excluding it from shared classes breaks the test)

https://github.com/apache/spark/blob/afcccb42c96cc7785a85c9463523f34d7a900a1d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L205C3-L219

So technically, if we want to upgrade Guava while keeping IsolatedClientLoader, we should make all Hive versions compatible with the new Guava, it's impossible.

And as I comment in https://www.mail-archive.com/[email protected]/msg30708.html, I think the IsolatedClientLoader is not required after dropping Java 8

Dropping support for Java 8 means dropping support for Hive lower than 2.0(exclusive)[1].

IsolatedClientLoader aims to allow using different Hive jars to communicate with different versions of HMS. AFAIK, the current built-in Hive 2.3.9 client works well on communicating with the Hive Metastore server through 2.1 to 3.1 (maybe 2.0 too, not sure). This brings a new question, is the IsolatedClientLoader required then?

I think we should drop IsolatedClientLoader because

  1. As explained above, we can use the built-in Hive 2.3.9 client to communicate with HMS 2.1+
  2. Since SPARK-42539[2], the default Hive 2.3.9 client does not use IsolatedClientLoader, and as explained in SPARK-42539, IsolatedClientLoader causes some inconsistent behaviors.
  3. It blocks Guava upgrading. HIVE-27560[3] aim to make Hive 2.3.10(unreleased) compatible with all Guava 14+ versions, but unfortunately, Guava is marked as isSharedClass[3] in IsolatedClientLoader, so technically, if we want to upgrade Guava we need to make all supported Hive versions(through 2.0.x to 3.1.x) to support high version of Guava, I think it's impossible.

[1] sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientVersions.scala [2] https://issues.apache.org/jira/browse/SPARK-42539 [3] https://issues.apache.org/jira/browse/HIVE-27560 [4] https://github.com/apache/spark/pull/33989#issuecomment-926277286

pan3793 avatar Aug 15 '23 12:08 pan3793

Dropping IsolatedClientLoader is also my main concern here - I'm not sure whether anyone in the community still rely on the feature for some reason (cc @xkrogen). I do agree that Hive 2.3.9/10 should be compatible with most Hive versions.

Perhaps it's worth checking out in the dev mailing list, if it's not done already.

sunchao avatar Aug 16 '23 01:08 sunchao

Perhaps it's worth checking out in the dev mailing list, if it's not done already.

@sunchao I posted the proposal in the mailing list thread What else could be removed in Spark 4?.

pan3793 avatar Aug 16 '23 04:08 pan3793

I do agree that Hive 2.3.9/10 should be compatible with most Hive versions.

Yes, I can contribute some use cases.

  • We use Hive 2.3.9 to access HMS 2.1/2.3 internally for several years.
  • In default setup of E-MapReduce of AliCloud, it uses Spark 3.3.1(Hive 2.3.9) to access HMS 3.1.

pan3793 avatar Aug 16 '23 04:08 pan3793

Dropping IsolatedClientLoader is also my main concern here - I'm not sure whether anyone in the community still rely on the feature for some reason (cc @xkrogen). I do agree that Hive 2.3.9/10 should be compatible with most Hive versions.

Thanks for the ping @sunchao! We're still using it in Spark 3.1.1 to access Hive 1.1, but we're working to get rid of it after the compatibility work you did on the HMS side (thanks for that!). Our only blocker has been some rollout concerns, nothing from the compatibility angle. So, good from my perspective.

xkrogen avatar Aug 16 '23 17:08 xkrogen

Sharing a real case from my company. Using Spark version 3.2 in production environment, using buildin's Hive Client version 2.3.9 with two patches, communicating with Hive meta store 1.1.0, it has been working fine for three years now.

HIVE-25500 Switch back to alter_partition(s) in HMS client for Hive 2.3.x [SPARK-44454][SQL][HIVE] HiveShim getTablesByType support fallback

cxzl25 avatar Aug 21 '23 08:08 cxzl25

Regarding the proposed removal of IsolatedClientLoader, I am concerned that doing so might break some users' ability to interface with very old Hive metastores:

Although newer Hive client versions are compatible with a wider range of metastore versions, I am unsure whether they cover all old versions. If users are relying on Hive 0.13, for example, then I am concerned that removing IsolatedClientLoader might prevent those users from being able to upgrade Spark without first upgrading their metastore.

As was pointed out upthread in https://github.com/apache/spark/pull/42493#issuecomment-1678857243, though, those old Hive versions require Java 8. If Spark 4.0 drops JDK8 support then we'd already lose support for Hive < 2.0 using out-of-the-box Hive metastore clients.

However, some users run forked versions of old Hive which do support newer JDKs and for them the IsolatedClientLoader removal would create pains for their Spark 4.0 upgrade. Since those users are already forking Hive, though, I suppose they could simply recompile their Hive fork against Spark's newer Guava version.


That said, I wonder whether we can decouple the IsolatedClientLoader removal decision from the Guava upgrade by removing Guava from sharedClasses (previously discussed at https://github.com/apache/spark/pull/33989#issuecomment-928616327). This could also prove useful in case we uncover some other reason why we might want to continue using IsolatedClientLoader.

I think Guava would only need to be shared if we were passing instances of Guava classes across the classloader boundary, but I don't think we're doing that.

I tried removing Guava from sharedClasses and used SBT to run org.apache.spark.sql.hive.client.HiveClientSuites and tests seem to pass. Given this, maybe we could try kicking off a full CI run to see if we get any failures there? I think it may be worth re-testing here in order to verify our assumptions about Guava and the IsolatedClientLoader.

JoshRosen avatar Aug 21 '23 21:08 JoshRosen

@JoshRosen thanks for the comments, https://github.com/apache/spark/pull/42599 is opened to try removing Guava from sharedClasses, let's wait for CI results.

If Spark 4.0 drops JDK8 support then we'd already lose support for Hive < 2.0 using out-of-the-box Hive metastore clients.

Yes, definitely, from the community perspective, that means Spark 4.0 drops support for HMS < 2.0.

However, some users run forked versions of old Hive which do support newer JDKs and for them the IsolatedClientLoader removal would create pains for their Spark 4.0 upgrade. Since those users are already forking Hive, though, I suppose they could simply recompile their Hive fork against Spark's newer Guava version.

I think it's out of the scope of the Spark community's responsibility. And it's very weird to combine Hive 0.13(released in 2014) with Spark 4.0(will release in 2024), considering that Spark 4.0 decided to drop support for Java 8 and Hadoop 2, I don't think we should insist to support such an old Hive version.

I definitely see lots of users fork and maintain their internal Hadoop/Hive/Spark versions, but as you said, they could simply recompile their Hive fork against Spark's newer Guava version, I suppose they also could recompile their Spark fork against their Hive fork.

In short, I think it does not make sense that a user expects Spark 4.0.0 should work out-of-box with a forked Hive released a decade ago. He/she should expect some code changes on their Spark fork to make such a combination work.

pan3793 avatar Aug 22 '23 06:08 pan3793

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Dec 01 '23 00:12 github-actions[bot]

UPDATE

Dropping IsolatedClientLoader is not necessary after we dropped support for Hive prior 2.0.0 in SPARK-45328, see technical details at https://github.com/apache/spark/pull/42599.

I think there are no known technical issues blocking us from upgrading Guava while keeping IsolatedClientLoader in Spark 4.0.0

Summary of upgrading Guava steps:

  1. SPARK-45292: Remove Guava from shared classes from IsolatedClientLoader https://github.com/apache/spark/pull/42599
  2. Wait for Hive 2.3.10 for HIVE-27560
  3. Upgrade Spark built-in Guava to 32+

pan3793 avatar Dec 18 '23 03:12 pan3793

Thank you for updating this, @pan3793 .

dongjoon-hyun avatar May 10 '24 14:05 dongjoon-hyun

@JoshRosen @srowen @sunchao @dongjoon-hyun @LuciferYang @HyukjinKwon I think we are ready to upgrade Guava to the modern version, could you please take a look?

pan3793 avatar May 15 '24 07:05 pan3793

should we test 33.2.0?

LuciferYang avatar May 15 '24 07:05 LuciferYang

should we test 33.2.0?

33.1.0-jre is chosen because Spark uses 33.1.0-jre in other places, and after this change, we can unify all Guava versions.

pan3793 avatar May 15 '24 08:05 pan3793

should we test 33.2.0?

33.1.0-jre is chosen because Spark uses 33.1.0-jre in other places, and after this change, we can unify all Guava versions.

OK

LuciferYang avatar May 16 '24 03:05 LuciferYang

Is there a possibility that upgrading Guava could carry similar risks as described in https://github.com/apache/spark/pull/46528 ? Could there be legacy Hive UDF jars that are bound to Guava 14.0.1?

WDYT @dongjoon-hyun ?

LuciferYang avatar May 16 '24 06:05 LuciferYang

Should we continue to push this one forward?

LuciferYang avatar May 23 '24 02:05 LuciferYang

@dongjoon-hyun I rebased this PR and updated the description. Guava upgrading was requested many times from various users, it would be great if we could deliver this Guava upgradig in the coming Spark 4.0.0-preview2

pan3793 avatar Sep 11 '24 18:09 pan3793

I agree with you, @pan3793 . I hope we can try in preview2.

dongjoon-hyun avatar Sep 11 '24 18:09 dongjoon-hyun

Merged to master for Apache Spark 4.0.0-preview2.

dongjoon-hyun avatar Sep 12 '24 16:09 dongjoon-hyun

Thank you, @pan3793 and @LuciferYang .

I want to ping you guys for head-ups once more, @JoshRosen , @cloud-fan , @HyukjinKwon , @yaooqinn , @mridulm , @viirya , @huaxingao .

dongjoon-hyun avatar Sep 12 '24 16:09 dongjoon-hyun