xgboost
xgboost copied to clipboard
[jvm-packages] XGBoost4J-Spark 2.0.0-RC1 fails for Spark 3.4.0 on EMR
Hello everybody,
I am trying to implement XGBoost4J-Spark in a scala project. Everything works fine locally (on an intel MacBook), however when deploying to EMR, I receive the following error (running on EMR 6.12.0 and Spark 3.4.0 with Scala 2.12.17):
java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI
For my build.sbt
I added the following lines to libraryDependencies
, as suggested by the tutorial (running with sbt 1.9.2):
"ml.dmlc" %% "xgboost4j" % "1.7.6",
"ml.dmlc" %% "xgboost4j-spark" % "1.7.6"
I packaged everything up into a single JAR via the sbt-assembly
plugin. I believe that this would pack all the dependencies into the JAR that is needed to run the Spark Application on EMR, so I am really out of ideas about this error. Not sure if this is an error on my end or an actual bug. Help is appreciated!
Can you install XGBoost4J-Spark from Maven Central? Locally building JARs is more complex, as you might have issues with bundling the native library (libxgboost4j.so).
I assume you mean to build the packages from source?
I tried that on the master of the EMR cluster, but ran into errors. I ran the following steps:
Installing maven:
wget https://dlcdn.apache.org/maven/maven-3/3.9.4/binaries/apache-maven-3.9.4-bin.tar.gz
tar xzvf apache-maven-3.9.4-bin.tar.gz
PATH=$PATH:/home/hadoop/apache-maven-3.9.4/bin
Cloning Repo and switching to 1.7.6 and then packaging it up (following steps in tutorial):
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
git checkout 36eb41c
cd jvm-packages
mvn package
This ended up in the following error:
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for XGBoost JVM Package 1.7.6:
[INFO]
[INFO] XGBoost JVM Package ................................ SUCCESS [ 2.298 s]
[INFO] xgboost4j_2.12 ..................................... FAILURE [ 1.973 s]
[INFO] xgboost4j-spark_2.12 ............................... SKIPPED
[INFO] xgboost4j-flink_2.12 ............................... SKIPPED
[INFO] xgboost4j-example_2.12 ............................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 4.651 s
[INFO] Finished at: 2023-08-22T23:09:24Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:exec (native) on project xgboost4j_2.12: Command execution failed.: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn <args> -rf :xgboost4j_2.12
which seems to be caused by this:
File "create_jni.py", line 125
run(f'"{sys.executable}" mapfeat.py')
^
SyntaxError: invalid syntax
[ERROR] Command execution failed.
Similarly I tried to install it via running ./xgboost/jvm-packages/dev/build-linux.sh
as suggested by the README in jvm-packages. This too fails somewhere down the line with:
docker: Error response from daemon: pull access denied for dmlc/xgboost4j-build, repository does not exist or may require 'docker login'
I feel like I am well of the beaten path here and probably miss something quite obvious...
Do you have a working Python 3 installation?
I didn't realize you have to build from the source when using EMR. Do you need an uber-JAR where all dependencies are included? I found it hard to build such a JAR.
Yes I do have a python3 installation, but it seems like this error is caused by a python2 invocation on the "create_jni.py" file. Invoking python
from the command line opens python 3.7.16 shell. I am not sure how and why python2 is invoked.
I just want something that reliably works in production, building these uber-JARs hasn't failed me so far.
@nawidsayed, I guess you can hack the python path from here https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j/pom.xml#L88
Thanks for your help so far everybody! I noticed that I am running on EMR Graviton 2 processors (r6gd
instance) which are ARM based and I believe that might not be well supported by XGBoost4J. Switched to r5d
instances (with intel xeon) and all the dependencies now seem to be present. However I am still encountering a very generic error (from the master node):
23/08/23 08:55:07 WARN XGBoostSpark: train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names'
23/08/23 08:55:07 ERROR RabitTracker$TrackerProcessLogger: Tracker Process ends with exit code 1
23/08/23 08:55:08 ERROR XGBoostSpark: the job was aborted due to
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:418) ~[test.jar:1.0.0]
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:202) ~[test.jar:1.0.0]
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:34) ~[test.jar:1.0.0]
at org.apache.spark.ml.Predictor.fit(Predictor.scala:114) ~[spark-mllib_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:65) ~[test.jar:1.0.0]
at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:18) ~[test.jar:1.0.0]
at com.jobs.SparkJobWithJson.main(SparkJobWithJson.scala:34) ~[test.jar:1.0.0]
at com.jobs.PrototypeSparkJob.main(PrototypeSparkJob.scala) ~[test.jar:1.0.0]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_382]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_382]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_382]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_382]
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1066) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1158) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1167) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
Exception in thread "main" ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:418)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:202)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:34)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:65)
at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:18)
at com.jobs.SparkJobWithJson.main(SparkJobWithJson.scala:34)
at com.jobs.PrototypeSparkJob.main(PrototypeSparkJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1066)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1158)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1167)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'
I haven't found a remedy for this. Almost all references for this error are usually caused when using the GPU implementation. This however is not the case for me here. It's confusing because if I check stderr
on the executor, it's clear that the training is actually happening and there is no indication of an error there.
[09:34:54] [0] train-mlogloss:0.98398036223191476
[09:34:54] [0] train-mlogloss:0.97309246502424540
[09:34:54] [1] train-mlogloss:0.88586563941759944
[09:34:54] [1] train-mlogloss:0.86604528834945282
[09:34:54] [2] train-mlogloss:0.80109514334262943
[09:34:54] [2] train-mlogloss:0.77383518846411459
[09:34:54] [3] train-mlogloss:0.72730388396825540
[09:34:54] [4] train-mlogloss:0.66267788104521919
[09:34:54] [3] train-mlogloss:0.69377712826979787
[09:34:54] [5] train-mlogloss:0.60579290756812465
[09:34:54] [4] train-mlogloss:0.62382606613008595
[09:34:54] [6] train-mlogloss:0.55597261434946310
....
23/08/23 09:34:56 INFO Executor: 1 block locks were not released by task 0.0 in stage 4.0 (TID 4)
[rdd_25_0]
23/08/23 09:34:56 INFO MemoryStore: Block taskresult_4 stored as bytes in memory (estimated size 6.1 MiB, free 6.1 GiB)
23/08/23 09:34:56 INFO Executor: Finished task 0.0 in stage 4.0 (TID 4). 6442557 bytes result sent via BlockManager)
23/08/23 09:34:56 INFO YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
23/08/23 09:34:56 INFO MemoryStore: MemoryStore cleared
23/08/23 09:34:56 INFO BlockManager: BlockManager stopped
23/08/23 09:34:56 INFO ShutdownHookManager: Shutdown hook called
23/08/23 09:34:56 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1692778153848_0006/spark-d1d9df8a-dde2-47ce-8ca7-fc09fde80055
So it seems like it's related to Spark & XGBoost versioning. Using Spark 3.4.0 on Scala 2.12 and XGBoost packages versions 1.7.6 I get the aforementioned error which is probably related to the Rabbit Tracker. StdOut prints Tracker started, with env={}
just before erroring out.
However I don't have any issues when running with Spark 2.4.8 on Scala 2.11 and using XGBoost4j and XGBoost4j-spark with version 1.1.2 . In that case just before the training routine, stdOut read: Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=172.31.89.29, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=36}
.
Is there any way to make it properly work for Spark 3.4 ?
XGBoost 1.7.6 supports Spark 3.0.1: https://github.com/dmlc/xgboost/blob/36eb41c960483c8b52b44082663c99e6a0de440a/jvm-packages/pom.xml#L37
You can use XGBoost 2.0.0 to use Spark 3.4.0: https://github.com/dmlc/xgboost/blob/4301558a5711e63bbf004d2b6fca003906fb743c/jvm-packages/pom.xml#L38
Thanks for pointing this out. Unfortunately adding the library according to instructions here, fails in the following way when running sbt compile
:
[error] (update) java.net.URISyntaxException: Illegal character in path at index 106: https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/release/ml/dmlc/xgboost4j_2.12/2.0.0-RC1/xgboost4j_${scala.binary.version}-2.0.0-RC1.jar
Even when manually adding the 2.0.0-RC1 packages to the jar, we run into the Rabbit Tracker error:
Tracker started, with env={}
23/08/23 16:53:11 ERROR RabitTracker$TrackerProcessLogger: Tracker Process ends with exit code 1
23/08/23 16:53:12 ERROR XGBoostSpark: the job was aborted due to
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
Even after this error, the executors still commence with training, according to their log:
[16:59:23] [97] train-mlogloss:0.63060436905694506
[16:59:23] [97] train-mlogloss:0.63249886897005347
[16:59:24] [98] train-mlogloss:0.63300104089375020
...
I think we should prioritize the refactoring of the tracker, otherwise JVM related issues are quite difficult to resolve
is it possible the tracker is also running with python 2?
I don't know, isn't it written in C? The default python
command resolves to python 3.7.16 on EMR tho. Anyways, I was able to run Xgboost-Spark 1.1.2 on EMR 5.36.1 (Spark 2.4.8) successfully and I didn't change anything besides EMR and XGboost version to get it running.
If it helps, I could write out a minimal example that leads to aforementioned success and failure respectively.
I bumped into the exact same generic error reported by the OP, using a very similar setup (EMR 6.5.0, Spark 3.1.2). Even if I am using Scala Spark, there is a python dependence through RabitTracker, which requires python >= 3.8. But EMR 6.5.0 provides python3.7. Setting up a virtual environtment that allows the cluster to use a higher python version solved the problem for me.
Coming back again, after the solution I suggested in my post in Nov 28 didn't seem to work out on a second attempt. For me it was important to activate the virtual environment with the right python version before starting my spark-shell session in the master node.
So in the master node I would run
source pyspark_venv_python_3.9.9/bin/activate
and then I would launch my spark-shell session with:
MASTER=yarn-client /usr/bin/spark-shell \
--name my_static_shell \
--queue default \
--driver-memory 20G \
--executor-memory 16G \
--executor-cores 1 \
--num-executors 90 \
--archives s3://mypath/pyspark_venv_python_3.9.9.tar.gz#environment \
--conf spark.yarn.maxAppAttempts=0 \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.task.cpus=1 \
--conf spark.kryoserializer.buffer.max=2047mb \
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python \
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python \
--conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python \
--jars s3://path_to_one.jar
Only then is the tracker able to start with a sensible environment:
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.x.x.x, DMLC_TRACKER_PORT=xxxxx, DMLC_NUM_WORKER=80}
If I am not in the virtual environment before launching the shell, the tracker fails.
That's caused by the Python dependency. We have removed the use of Python in the master branch.
Thanks @trivialfis. I am bound to use version 1.7.3, but it's great to hear the python dependency has been removed in recent versions. It was really a pain to deal with.