xgboost [jvm-packages] XGBoost4J-Spark 2.0.0-RC1 fails for Spark 3.4.0 on EMR

Hello everybody,

I am trying to implement XGBoost4J-Spark in a scala project. Everything works fine locally (on an intel MacBook), however when deploying to EMR, I receive the following error (running on EMR 6.12.0 and Spark 3.4.0 with Scala 2.12.17):

java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI

For my build.sbt I added the following lines to libraryDependencies, as suggested by the tutorial (running with sbt 1.9.2):

"ml.dmlc" %% "xgboost4j" % "1.7.6",
"ml.dmlc" %% "xgboost4j-spark" % "1.7.6"

I packaged everything up into a single JAR via the sbt-assembly plugin. I believe that this would pack all the dependencies into the JAR that is needed to run the Spark Application on EMR, so I am really out of ideas about this error. Not sure if this is an error on my end or an actual bug. Help is appreciated!

Aug 22 '23 17:08 nawidsayed

Can you install XGBoost4J-Spark from Maven Central? Locally building JARs is more complex, as you might have issues with bundling the native library (libxgboost4j.so).

Aug 22 '23 19:08 hcho3

I assume you mean to build the packages from source?

I tried that on the master of the EMR cluster, but ran into errors. I ran the following steps:

Installing maven:

wget https://dlcdn.apache.org/maven/maven-3/3.9.4/binaries/apache-maven-3.9.4-bin.tar.gz
tar xzvf apache-maven-3.9.4-bin.tar.gz
PATH=$PATH:/home/hadoop/apache-maven-3.9.4/bin

Cloning Repo and switching to 1.7.6 and then packaging it up (following steps in tutorial):

git clone --recursive https://github.com/dmlc/xgboost

cd xgboost
git checkout 36eb41c
cd jvm-packages
mvn package

This ended up in the following error:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for XGBoost JVM Package 1.7.6:
[INFO]
[INFO] XGBoost JVM Package ................................ SUCCESS [  2.298 s]
[INFO] xgboost4j_2.12 ..................................... FAILURE [  1.973 s]
[INFO] xgboost4j-spark_2.12 ............................... SKIPPED
[INFO] xgboost4j-flink_2.12 ............................... SKIPPED
[INFO] xgboost4j-example_2.12 ............................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  4.651 s
[INFO] Finished at: 2023-08-22T23:09:24Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:exec (native) on project xgboost4j_2.12: Command execution failed.: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :xgboost4j_2.12

which seems to be caused by this:

  File "create_jni.py", line 125
    run(f'"{sys.executable}" mapfeat.py')
                                       ^
SyntaxError: invalid syntax
[ERROR] Command execution failed.

Similarly I tried to install it via running ./xgboost/jvm-packages/dev/build-linux.sh as suggested by the README in jvm-packages. This too fails somewhere down the line with:

docker: Error response from daemon: pull access denied for dmlc/xgboost4j-build, repository does not exist or may require 'docker login'

I feel like I am well of the beaten path here and probably miss something quite obvious...

Aug 22 '23 23:08 nawidsayed

Do you have a working Python 3 installation?

I didn't realize you have to build from the source when using EMR. Do you need an uber-JAR where all dependencies are included? I found it hard to build such a JAR.

Aug 23 '23 03:08 hcho3

Yes I do have a python3 installation, but it seems like this error is caused by a python2 invocation on the "create_jni.py" file. Invoking python from the command line opens python 3.7.16 shell. I am not sure how and why python2 is invoked.

I just want something that reliably works in production, building these uber-JARs hasn't failed me so far.

Aug 23 '23 07:08 nawidsayed

@nawidsayed, I guess you can hack the python path from here https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j/pom.xml#L88

Aug 23 '23 07:08 wbo4958

Thanks for your help so far everybody! I noticed that I am running on EMR Graviton 2 processors (r6gd instance) which are ARM based and I believe that might not be well supported by XGBoost4J. Switched to r5d instances (with intel xeon) and all the dependencies now seem to be present. However I am still encountering a very generic error (from the master node):

23/08/23 08:55:07 WARN XGBoostSpark: train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names'
23/08/23 08:55:07 ERROR RabitTracker$TrackerProcessLogger: Tracker Process ends with exit code 1
23/08/23 08:55:08 ERROR XGBoostSpark: the job was aborted due to 
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:418) ~[test.jar:1.0.0]
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:202) ~[test.jar:1.0.0]
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:34) ~[test.jar:1.0.0]
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:114) ~[spark-mllib_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:65) ~[test.jar:1.0.0]
	at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:18) ~[test.jar:1.0.0]
	at com.jobs.SparkJobWithJson.main(SparkJobWithJson.scala:34) ~[test.jar:1.0.0]
	at com.jobs.PrototypeSparkJob.main(PrototypeSparkJob.scala) ~[test.jar:1.0.0]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_382]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_382]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_382]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_382]
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1066) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1158) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1167) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
Exception in thread "main" ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:418)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:202)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:34)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
	at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:65)
	at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:18)
	at com.jobs.SparkJobWithJson.main(SparkJobWithJson.scala:34)
	at com.jobs.PrototypeSparkJob.main(PrototypeSparkJob.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1066)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1158)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1167)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'

I haven't found a remedy for this. Almost all references for this error are usually caused when using the GPU implementation. This however is not the case for me here. It's confusing because if I check stderr on the executor, it's clear that the training is actually happening and there is no indication of an error there.

[09:34:54] [0]	train-mlogloss:0.98398036223191476
[09:34:54] [0]	train-mlogloss:0.97309246502424540


[09:34:54] [1]	train-mlogloss:0.88586563941759944

[09:34:54] [1]	train-mlogloss:0.86604528834945282

[09:34:54] [2]	train-mlogloss:0.80109514334262943

[09:34:54] [2]	train-mlogloss:0.77383518846411459

[09:34:54] [3]	train-mlogloss:0.72730388396825540

[09:34:54] [4]	train-mlogloss:0.66267788104521919

[09:34:54] [3]	train-mlogloss:0.69377712826979787

[09:34:54] [5]	train-mlogloss:0.60579290756812465

[09:34:54] [4]	train-mlogloss:0.62382606613008595

[09:34:54] [6]	train-mlogloss:0.55597261434946310
....


23/08/23 09:34:56 INFO Executor: 1 block locks were not released by task 0.0 in stage 4.0 (TID 4)
[rdd_25_0]
23/08/23 09:34:56 INFO MemoryStore: Block taskresult_4 stored as bytes in memory (estimated size 6.1 MiB, free 6.1 GiB)
23/08/23 09:34:56 INFO Executor: Finished task 0.0 in stage 4.0 (TID 4). 6442557 bytes result sent via BlockManager)
23/08/23 09:34:56 INFO YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
23/08/23 09:34:56 INFO MemoryStore: MemoryStore cleared
23/08/23 09:34:56 INFO BlockManager: BlockManager stopped
23/08/23 09:34:56 INFO ShutdownHookManager: Shutdown hook called
23/08/23 09:34:56 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1692778153848_0006/spark-d1d9df8a-dde2-47ce-8ca7-fc09fde80055

Aug 23 '23 09:08 nawidsayed

So it seems like it's related to Spark & XGBoost versioning. Using Spark 3.4.0 on Scala 2.12 and XGBoost packages versions 1.7.6 I get the aforementioned error which is probably related to the Rabbit Tracker. StdOut prints Tracker started, with env={} just before erroring out.

However I don't have any issues when running with Spark 2.4.8 on Scala 2.11 and using XGBoost4j and XGBoost4j-spark with version 1.1.2 . In that case just before the training routine, stdOut read: Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=172.31.89.29, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=36}.

Aug 23 '23 15:08 nawidsayed

Is there any way to make it properly work for Spark 3.4 ?

Aug 23 '23 15:08 nawidsayed

XGBoost 1.7.6 supports Spark 3.0.1: https://github.com/dmlc/xgboost/blob/36eb41c960483c8b52b44082663c99e6a0de440a/jvm-packages/pom.xml#L37

You can use XGBoost 2.0.0 to use Spark 3.4.0: https://github.com/dmlc/xgboost/blob/4301558a5711e63bbf004d2b6fca003906fb743c/jvm-packages/pom.xml#L38

Aug 23 '23 15:08 hcho3

Thanks for pointing this out. Unfortunately adding the library according to instructions here, fails in the following way when running sbt compile:

[error] (update) java.net.URISyntaxException: Illegal character in path at index 106: https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/release/ml/dmlc/xgboost4j_2.12/2.0.0-RC1/xgboost4j_${scala.binary.version}-2.0.0-RC1.jar

Aug 23 '23 16:08 nawidsayed

Even when manually adding the 2.0.0-RC1 packages to the jar, we run into the Rabbit Tracker error:

Tracker started, with env={}
23/08/23 16:53:11 ERROR RabitTracker$TrackerProcessLogger: Tracker Process ends with exit code 1
23/08/23 16:53:12 ERROR XGBoostSpark: the job was aborted due to
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed

Even after this error, the executors still commence with training, according to their log:

[16:59:23] [97]	train-mlogloss:0.63060436905694506
[16:59:23] [97]	train-mlogloss:0.63249886897005347
[16:59:24] [98]	train-mlogloss:0.63300104089375020
...

Aug 23 '23 16:08 nawidsayed

I think we should prioritize the refactoring of the tracker, otherwise JVM related issues are quite difficult to resolve

Aug 23 '23 22:08 trivialfis

is it possible the tracker is also running with python 2?

Aug 23 '23 23:08 wbo4958

I don't know, isn't it written in C? The default python command resolves to python 3.7.16 on EMR tho. Anyways, I was able to run Xgboost-Spark 1.1.2 on EMR 5.36.1 (Spark 2.4.8) successfully and I didn't change anything besides EMR and XGboost version to get it running.

If it helps, I could write out a minimal example that leads to aforementioned success and failure respectively.

Aug 24 '23 08:08 nawidsayed

I bumped into the exact same generic error reported by the OP, using a very similar setup (EMR 6.5.0, Spark 3.1.2). Even if I am using Scala Spark, there is a python dependence through RabitTracker, which requires python >= 3.8. But EMR 6.5.0 provides python3.7. Setting up a virtual environtment that allows the cluster to use a higher python version solved the problem for me.

Nov 29 '23 06:11 djmarti

Coming back again, after the solution I suggested in my post in Nov 28 didn't seem to work out on a second attempt. For me it was important to activate the virtual environment with the right python version before starting my spark-shell session in the master node.

So in the master node I would run

source pyspark_venv_python_3.9.9/bin/activate

and then I would launch my spark-shell session with:

MASTER=yarn-client /usr/bin/spark-shell \
  --name my_static_shell \
  --queue default \
  --driver-memory 20G \
  --executor-memory 16G \
  --executor-cores 1 \
  --num-executors 90 \
  --archives s3://mypath/pyspark_venv_python_3.9.9.tar.gz#environment \
  --conf spark.yarn.maxAppAttempts=0 \
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.task.cpus=1 \
  --conf spark.kryoserializer.buffer.max=2047mb \
  --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python \
  --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python \
  --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python \
  --jars s3://path_to_one.jar

Only then is the tracker able to start with a sensible environment:

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.x.x.x, DMLC_TRACKER_PORT=xxxxx, DMLC_NUM_WORKER=80}

If I am not in the virtual environment before launching the shell, the tracker fails.

May 20 '24 04:05 djmarti

That's caused by the Python dependency. We have removed the use of Python in the master branch.

May 20 '24 05:05 trivialfis

Thanks @trivialfis. I am bound to use version 1.7.3, but it's great to hear the python dependency has been removed in recent versions. It was really a pain to deal with.

May 20 '24 21:05 djmarti

xgboost xgboost copied to clipboard

[jvm-packages] XGBoost4J-Spark 2.0.0-RC1 fails for Spark 3.4.0 on EMR

xgboost
xgboost copied to clipboard