celeborn [CELEBORN-1402] SparkShuffleManager print warning log for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar

What changes were proposed in this pull request?

SparkShuffleManager print warning log for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar via --jar or spark.jars.

Why are the changes needed?

When spark.executor.userClassPathFirst is enabled with ShuffleManager defined in user jar, the ClassLoader of handle is ChildFirstURLClassLoader, which is different from CelebornShuffleHandle of which the ClassLoader is AppClassLoader in SparkShuffleManager#getWriter/getReader. The local test log is as follows:

./bin/spark-sql --master yarn --deploy-mode client \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.jars=/tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false

./bin/spark-sql --master yarn --deploy-mode client --jars /tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false

24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: sun.misc.Launcher$AppClassLoader

It causes that SparkShuffleManager fallback to vanilla Spark SortShuffleManager for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar before https://github.com/apache/spark/pull/43627. After SPARK-45762, the ClassLoader of handle and CelebornShuffleHandle are both ChildFirstURLClassLoader.

24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: org.apache.spark.util.ChildFirstURLClassLoader

Therefore, SparkShuffleManager should print warning log to remind for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual test.

Apr 28 '24 18:04 SteNicholas

Ping @mridulm, @lifulong, @FMX.

Apr 28 '24 18:04 SteNicholas

We ask users to copy celeborn client side jar to spark jars directory - that should ensure this problem does not occur, right ?

So this would apply only when users are leveraging --jars/spark.jars in order to specify celeborn jar ? If yes, detect and add warning only when that difference exists ?

Apr 29 '24 05:04 mridulm

@mridulm, I left some points according to your reply:

We could not ask user to copy celeborn client side jar to spark jars directory. It's required that the celeborn client jar is placed in HDFS directory instead of spark jars directory in some production environment like bilibili. We could allow user to specify the celeborn client jar and provide the warning log for the cases that fallback to SortManager.
This problem only occurs in the case that spark.executor.userClassPathFirst=true and use --jars/spark.jars to specify the celeborn client jar, which has been given reproduce command in description. The company Zhihu @lifulong has met this problem in their production practice. The solution of this problem has two ways: disable spark.executor.userClassPathFirst or cherry pick https://github.com/apache/spark/pull/43627.
Meanwhile, I have mentioned this point in warning log that fallback to SortManager when setting spark.executor.userClassPathFirst to true and use ShuffleManager defined in user jar for kindly reminder.
BTW, I have also mentioned that this problem does not occur after SPARK-45762.

Do you have any suggestion to the warning log?

Apr 29 '24 06:04 SteNicholas

Ping @mridulm, @lifulong, @FMX.

LGTM

Apr 29 '24 07:04 lifulong

@SteNicholas That makes sense, there are definitely scenarios where we cannot update the spark distribution with additional jars.

The reason why I brought that up is, currently the warning will always be emitted ... even if celeborn jars are part of spark distribution and so not susceptible to the issue.

Given this, can be add to the if condition to check if this is a problematic case ?

Apr 30 '24 05:04 mridulm

@mridulm, I got your main point of concern. I have addressed above comment and reduced the scope of printing warning logs. PTAL.

May 05 '24 12:05 SteNicholas

@cxzl25, I have followed the check checkUserClassPathFirst in spark-2 module. PTAL.

May 06 '24 12:05 SteNicholas

@mridulm, PTAL.

May 07 '24 05:05 SteNicholas

@mridulm, @pan3793, @cfmcgrady, @cxzl25, I have address above comments. PTAL.

May 07 '24 13:05 SteNicholas

Hi @SteNicholas, which modes did you test this in ? I did a quick check against local cluster, spark standalone and local yarn - and keep getting this and variants of this with 3.1 (which is what I would expect - since spark does not support it).

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1956)
	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:402)
	at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
	at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.shuffle.celeborn.SparkShuffleManager
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:207)
	at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:275)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:338)
	at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:205)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:451)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1938)
	... 4 more

This should cause the executor to fail, and so wont get to this PR. Am I missing something here ?

I tested this with: ./bin/spark-submit --conf spark.executor.userClassPathFirst=true --conf spark.jars=./celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar --master 'yarn' --deploy-mode cluster --num-executors 2 --executor-memory 1g --executor-cores 1 --class com.mridul.test.Test ./test.jar

(basic celeborn config is there in my defaults config)

May 08 '24 06:05 mridulm

~Let me recheck that - I think the celeborn diff did not apply in my local build~

Works fine with celeborn jar in spark jar directory, not with --jars for 3.1: which is what I would expect actually (given SPARK-45762).

How were you able to test/validate this @SteNicholas ? It is possible I am missing some steps here ?

May 08 '24 06:05 mridulm

@mridulm, I just tested with Spark SQL via above command. Did you ever try to test Spark SQL to verify? Meanwhile, why does the above application fail with Caused by: java.lang.ClassNotFoundException: org.apache.spark.shuffle.celeborn.SparkShuffleManager? The celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar should be in classpath.

May 08 '24 07:05 SteNicholas

Hi @SteNicholas , I was able to reproduce this behavior when celeborn jar is there in both spark platform jars and is specified by users via --jars. Does this match your deployment experience ?

If yes, in this case, it might simply be easier to ask users not to add celeborn jar to their --jars ? (sc.parallelize(1 to 40).repartition(20).flatMap(_ => (1 to 1500000).iterator.map(num => num)).repartition(25).count() - though the specific query should not matter in this case)

Please let me know if I am missing anything.

May 08 '24 22:05 mridulm

@mridulm, my deployment experience is that the celeborn client jar is not placed in jars directory but placed in HDFS which path is set to spark.jars for using SparkShuffleManager. The reason why put the celeborn client jar to HDFS is that we need to decouple the celebrity client release from the spark release. Meanwhile, we could not ask the user not to do this behavior when user need to use --jar way. This is just internal deployment experience. Please let me know if anything wrong.

May 09 '24 07:05 SteNicholas

Before 4.0, Apache Spark does not support shuffle manager from user jars - so I am not sure how this is working for you ... unless the fix was cherry picked/independently fixed in the Spark fork you are running perhaps ?

Once 4.0 is released, this patch will become relevant: so the PR is useful and I am in favor of merging it ... I want to make sure we have the right expectations: given spark does not support this feature before 4.0, trying this will fail for users 😃

May 09 '24 11:05 mridulm

@mridulm, I uses the internal Spark 3.1 version to verify this. Anyway, this pull request could help user to confirm whether to use Celeborn in SparkShuffleManager. @mridulm, @pan3793, @cfmcgrady, @cxzl25, could this pull request be merged?

May 16 '24 06:05 SteNicholas

Merged to main(v0.5.0).

May 17 '24 03:05 SteNicholas