xgboost
xgboost copied to clipboard
XGBoost 4J spark giving XGBoostError: std::bad_alloc on databricks
I am using XGBoost 4J spark to create a distributed xgboost model on my data. I am developing my model in databricks.
The spark version is 3.1.1, scala 2.12 and XGBoost 4J 1.4.1
My cluster setup looks as below,
- 5 worker nodes with each worker of size 32GB and 16 cores
- Driver node with 14GB of memory and 4 cores
My cluster configuration looks as below, spark.executor.memory 9g spark.executor.cores 5
So basically I have 10 executors with 4.6GB memory and 1 driver with 3.3GB of memory.
I imported the package as below,
import ml.dmlc.xgboost4j.scala.spark.{XGBoostRegressionModel,XGBoostRegressor}
In order to find the best parameters for my model, I created a parameter grid with train-validation split as shown below,
//Parameter tuning
import org.apache.spark.ml.tuning._
import org.apache.spark.ml.PipelineModel
import Array._
//Create parameter grid
val paramGrid = new ParamGridBuilder()
.addGrid(xgbRegressor.maxDepth, range(6, 10, 2))
.addGrid(xgbRegressor.eta, Array(0.01))
.addGrid(xgbRegressor.minChildWeight, Array(8.0, 10.0, 12.0, 14.0))
.addGrid(xgbRegressor.gamma, Array(0, 0.25, 0.5, 1))
.build()
I then fit it to my data and saved it,
val trainValidationSplit = new TrainValidationSplit()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setTrainRatio(0.75)
val tvmodel = trainValidationSplit.fit(train)
tvmodel.write.overwrite().save("spark-train-validation-split-28072021")
The error comes up when I try to load the model again,
import org.apache.spark.ml.tuning._
val tvmodel = TrainValidationSplitModel.load("spark-train-validation-split-28072021")
The error message is XGBoostError: std::bad_alloc
I checked the executor and driver logs. The executor logs looked fine . I found the same error in driver logs (stderr and log4j files). Both the log files are attached here. stderr.txt log4j-2021-08-03-13.log
Since the error message was mainly found in the driver logs, I tried the following solutions,
- Increased the driver memory to 28GB
- Increased the driver cores to 8
- Made the driver same as worker.
- set spark.driver.maxResultSize to 0
- spark.driver.memory to 20g
But all the above failed. The log files clearly indicate that driver memory is not overloaded. Hence I am struggling to find out what the error actually is.
It would be great if one of you could point me in the right direction.
On further investigation, I increased the driver size to 128GB and 32 cores. And got the same error.
I continuously monitored the driver metrics and none of it was over loaded.
Number of CPU used
Memory usage
CPU usage in capacity Atleast 85% of CPU capacity is idle
Network
The log4J has two different error messages as
-
ERROR ScalaDriverLocal: User Code Stack Trace: ml.dmlc.xgboost4j.java.XGBoostError: std::bad_alloc
-
ERROR Instrumentation: ml.dmlc.xgboost4j.java.XGBoostError: std::bad_alloc
The stderr has one error message that looks as below,
- ERROR Uncaught throwable from user code: ml.dmlc.xgboost4j.java.XGBoostError: std::bad_alloc
I am starting to think this is a bug with the package and not my cluster configuration. Maybe the package is not compatible with scala save and load methods.
@wbo4958 Have you seen something similar before?
@jon-targaryen1995 I'm doing the exact same thing (same versions of scala, spark and xgboost4j) and I see the same error when I try to load my model. Any luck with this? I'm using EMR on aws.
Could you please try nightly build: https://xgboost.readthedocs.io/en/latest/install.html#id4 ?
No, I didn't see this, seems the issue may be related to https://github.com/dmlc/xgboost/pull/7067
Is this issue resolved ? I am getting the exact error.
@jon-targaryen1995 What did you do to resolve the bad_alloc issues? (If you ever did) Nodes with more memory? Change a setting in your spark or xgboost params?
@wsurles-vamp, which version were you using to repro this issue? Could you file an new issue and detailed steps to repro it? With that, I will check it. Thx
thanks so much @wbo4958 , (I am a coworker of @wsurles-vamp) this is sensitive data, I will need to create a fake dataset to reproduce to give to you and check what other details are ok to share in public. this is not on databricks.
I'll open a new issue soon with more details but here are some quick basics:
versions:
xgboost 1.5.1 (xgboost4j_2.12-1.5.1.jar / xgboost4j-spark_2.12-1.5.1.jar) xgboost 1.6 with fix from https://github.com/dmlc/xgboost/pull/7844
have seen with spark 3.1.2 and spark 3.2.1
we are using the pyspark wrapper from https://github.com/dmlc/xgboost/pull/4656#issuecomment-510693296 with only edit being adding in killSparkContextOnWorkerFailure param
some dataset info:
rows: 109 million rows with label=1 : 51 thousand cols: 254
'objective': 'binary:logistic' 'treeMethod': 'hist'
there are no cols of all 0s but some are very sparse.
we have a few datasets generating this error, this is just 1 of them, row/col count differs between them. we have some datasets of similar or bigger sizes not getting the error, and it can be intermittent with some rounds of hyperparam tuning getting it but others not, though we haven't notice a correlation between params like maxDepth and the error.
here is our specific stack trace:
ml.dmlc.xgboost4j.java.XGBoostError: std::bad_alloc
at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
at ml.dmlc.xgboost4j.java.DMatrix.(DMatrix.java:54)
at ml.dmlc.xgboost4j.scala.DMatrix.(DMatrix.scala:43)
at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:574)
at ml.dmlc.xgboost4j.scala.spark.PreXGBoost$.$anonfun$trainForNonRanking$1(PreXGBoost.scala:480)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1481)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1481)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
A couple other quick points of note... We use dense vectors (to ensure 0s are treated as 0.) We have a 26 node spark cluster with 120GiB of memory, though we have seen it on nodes with 180. And well before memory has run out, as many others have mentioned. Seen it on centos7 and rocky linux 8
we have a better idea of what is going on now, we'll create a proper issue soon but might be a few days this is another related question https://discuss.xgboost.ai/t/useexternalmemory-no-longer-does-anything/2761
FYI @kstock
From xgboost4j-spark 1.6.1, you don't need killSparkContextOnWorkerFailure parameter any more, since 1.6.1 has introduced the spark barrier execution mode to ensure every training task will be launched at the same time and spark will kill the failed training job if some training tasks failed. In a word, xgboost will not stop SparkContext any more.
we are using the pyspark wrapper from https://github.com/dmlc/xgboost/pull/4656#issuecomment-510693296 with only edit being adding in killSparkContextOnWorkerFailure param
Looks like the error happened in building DMatrix, From the error log "ml.dmlc.xgboost4j.java.XGBoostError: std::bad_alloc", looks like OOM issue?
@trivialfis any idea about this error log?
BTW, the dense vector may blow up the memory very quickly.
BTW, do you have got some chance to use xgboost4j-spark-gpu to train?
@wbo4958, We don't run out of memory per se, but xgboost does seem to increase the size of the memory it requests by a factor of 1.5 and eventually this is a very big jump and too much for the system to provide, and this is when we get the bad_alloc error.
In this graph of system memory (top) and committed memory (bottom), you can see the increase in committed memory in stages as system memory is used. There are two fits on this chart. The first fit fails (in the middle of the chart) when the big committed memory jump happens. (note: there is still ~40GiB available when it fails, the bottom of the chart is not 0) Then the second fit finishes a little sooner and never has that last large memory commit, and it succeeded.
With a debugger attached to custom build we saw Resizing ParallelGroupBuilder.data_ to 4505075712 elements
at the time of the failure. The elements count was steadily increasing. But I think it's the memory request thats causing the failure.
We are looking into getting more memory on our cluster and spreading out the xgboost load as much as possible, but is there something we could do to make the committed memory grow more evenly? Or have xgboost not request more than is available?
Hi @trivialfis, any idea about https://github.com/dmlc/xgboost/issues/7155#issuecomment-1113898292.
The QuantileDMatrix is coming to CPU as well if refactoring goes well. The ParallelGroupBuilder
is used during construction of DMatrix
.
Hi guys, do you have any update on this? I am getting this error when reading a Spark pipeline from mlflow. I get the error when I try to run the line loaded_model = mlflow.spark.load_model(run_path)
. Looks like XGBoost suddenly starts creating more and more jobs (a bit strange if I just want to load the model because I am doing nothing with it yet) and eventually crashes.
Hi guys, do you have any update on this? I am getting this error when reading a Spark pipeline from mlflow. I get the error when I try to run the line
loaded_model = mlflow.spark.load_model(run_path)
. Looks like XGBoost suddenly starts creating more and more jobs (a bit strange if I just want to load the model because I am doing nothing with it yet) and eventually crashes.
@olbapjose, loaded_model = mlflow.spark.load_model(run_path)
is the python code? so you were trying to load the model directly, and it crashed?
@wbo4958 Exactly, I was trying to load a trained ML pipeline that I had saved previously, where one of the stages is a Spark XGBoost classifier, and it crashes when it tries to load the XGBoost stage. The stack trace looks like the one pasted in an aforementioned message. I'm working in a Databricks notebook.
EDIT: interestingly, the issue arises inside a Databricks notebook (see the attached file for the complete stack trace) but does not arise when using the databricks-connect package to load the model from a local pycharm IDE connected to the Databricks cluster.
@olbapjose Would you share the model with the zip format, then we can repro in databricks?
@wbo4958 I suspect it is closely related to #8074 , where I have uploaded the trained model I am trying to read. That issue arises both in a Databricks notebook and when using databricks-connect.
Regarding https://github.com/dmlc/xgboost/issues/7155#issuecomment-1113898292 . I think the best way forward is to improve support for QuantileDMatrix
in jvm packages.