[CELEBORN-1402] SparkShuffleManager print warning log for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar
What changes were proposed in this pull request?
SparkShuffleManager print warning log for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar via --jar or spark.jars.
Why are the changes needed?
When spark.executor.userClassPathFirst is enabled with ShuffleManager defined in user jar, the ClassLoader of handle is ChildFirstURLClassLoader, which is different from CelebornShuffleHandle of which the ClassLoader is AppClassLoader in SparkShuffleManager#getWriter/getReader. The local test log is as follows:
./bin/spark-sql --master yarn --deploy-mode client \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.jars=/tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false
./bin/spark-sql --master yarn --deploy-mode client --jars /tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false
24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: sun.misc.Launcher$AppClassLoader
It causes that SparkShuffleManager fallback to vanilla Spark SortShuffleManager for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar before https://github.com/apache/spark/pull/43627. After SPARK-45762, the ClassLoader of handle and CelebornShuffleHandle are both ChildFirstURLClassLoader.
24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: org.apache.spark.util.ChildFirstURLClassLoader
Therefore, SparkShuffleManager should print warning log to remind for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manual test.
Ping @mridulm, @lifulong, @FMX.
We ask users to copy celeborn client side jar to spark jars directory - that should ensure this problem does not occur, right ?
So this would apply only when users are leveraging --jars/spark.jars in order to specify celeborn jar ?
If yes, detect and add warning only when that difference exists ?
@mridulm, I left some points according to your reply:
- We could not ask user to copy celeborn client side jar to spark jars directory. It's required that the celeborn client jar is placed in HDFS directory instead of spark jars directory in some production environment like bilibili. We could allow user to specify the celeborn client jar and provide the warning log for the cases that fallback to
SortManager. - This problem only occurs in the case that
spark.executor.userClassPathFirst=trueand use--jars/spark.jarsto specify the celeborn client jar, which has been given reproduce command in description. The company Zhihu @lifulong has met this problem in their production practice. The solution of this problem has two ways: disablespark.executor.userClassPathFirstor cherry pick https://github.com/apache/spark/pull/43627. - Meanwhile, I have mentioned this point in warning log that fallback to
SortManagerwhen settingspark.executor.userClassPathFirstto true and use ShuffleManager defined in user jar for kindly reminder. - BTW, I have also mentioned that this problem does not occur after SPARK-45762.
Do you have any suggestion to the warning log?
Ping @mridulm, @lifulong, @FMX.
LGTM
@SteNicholas That makes sense, there are definitely scenarios where we cannot update the spark distribution with additional jars.
The reason why I brought that up is, currently the warning will always be emitted ... even if celeborn jars are part of spark distribution and so not susceptible to the issue.
Given this, can be add to the if condition to check if this is a problematic case ?
@mridulm, I got your main point of concern. I have addressed above comment and reduced the scope of printing warning logs. PTAL.
@cxzl25, I have followed the check checkUserClassPathFirst in spark-2 module. PTAL.
@mridulm, PTAL.
@mridulm, @pan3793, @cfmcgrady, @cxzl25, I have address above comments. PTAL.
Hi @SteNicholas, which modes did you test this in ? I did a quick check against local cluster, spark standalone and local yarn - and keep getting this and variants of this with 3.1 (which is what I would expect - since spark does not support it).
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1956)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:402)
at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.shuffle.celeborn.SparkShuffleManager
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:207)
at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:275)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:338)
at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:205)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:451)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1938)
... 4 more
This should cause the executor to fail, and so wont get to this PR. Am I missing something here ?
I tested this with:
./bin/spark-submit --conf spark.executor.userClassPathFirst=true --conf spark.jars=./celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar --master 'yarn' --deploy-mode cluster --num-executors 2 --executor-memory 1g --executor-cores 1 --class com.mridul.test.Test ./test.jar
(basic celeborn config is there in my defaults config)
~Let me recheck that - I think the celeborn diff did not apply in my local build~
Works fine with celeborn jar in spark jar directory, not with --jars for 3.1: which is what I would expect actually (given SPARK-45762).
How were you able to test/validate this @SteNicholas ? It is possible I am missing some steps here ?
@mridulm, I just tested with Spark SQL via above command. Did you ever try to test Spark SQL to verify?
Meanwhile, why does the above application fail with Caused by: java.lang.ClassNotFoundException: org.apache.spark.shuffle.celeborn.SparkShuffleManager? The celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar should be in classpath.
Hi @SteNicholas , I was able to reproduce this behavior when celeborn jar is there in both spark platform jars and is specified by users via --jars.
Does this match your deployment experience ?
If yes, in this case, it might simply be easier to ask users not to add celeborn jar to their --jars ?
(sc.parallelize(1 to 40).repartition(20).flatMap(_ => (1 to 1500000).iterator.map(num => num)).repartition(25).count() - though the specific query should not matter in this case)
Please let me know if I am missing anything.
@mridulm, my deployment experience is that the celeborn client jar is not placed in jars directory but placed in HDFS which path is set to spark.jars for using SparkShuffleManager. The reason why put the celeborn client jar to HDFS is that we need to decouple the celebrity client release from the spark release. Meanwhile, we could not ask the user not to do this behavior when user need to use --jar way. This is just internal deployment experience. Please let me know if anything wrong.
Before 4.0, Apache Spark does not support shuffle manager from user jars - so I am not sure how this is working for you ... unless the fix was cherry picked/independently fixed in the Spark fork you are running perhaps ?
Once 4.0 is released, this patch will become relevant: so the PR is useful and I am in favor of merging it ... I want to make sure we have the right expectations: given spark does not support this feature before 4.0, trying this will fail for users 😃
@mridulm, I uses the internal Spark 3.1 version to verify this. Anyway, this pull request could help user to confirm whether to use Celeborn in SparkShuffleManager.
@mridulm, @pan3793, @cfmcgrady, @cxzl25, could this pull request be merged?
Merged to main(v0.5.0).