xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

XGBoost 4J spark giving XGBoostError: std::bad_alloc on databricks

Open jon-targaryen1995 opened this issue 3 years ago • 20 comments

I am using XGBoost 4J spark to create a distributed xgboost model on my data. I am developing my model in databricks.

The spark version is 3.1.1, scala 2.12 and XGBoost 4J 1.4.1

My cluster setup looks as below,

  • 5 worker nodes with each worker of size 32GB and 16 cores
  • Driver node with 14GB of memory and 4 cores

My cluster configuration looks as below, spark.executor.memory 9g spark.executor.cores 5

So basically I have 10 executors with 4.6GB memory and 1 driver with 3.3GB of memory.

image

I imported the package as below, import ml.dmlc.xgboost4j.scala.spark.{XGBoostRegressionModel,XGBoostRegressor}

In order to find the best parameters for my model, I created a parameter grid with train-validation split as shown below,

//Parameter tuning
    import org.apache.spark.ml.tuning._
    import org.apache.spark.ml.PipelineModel
    import Array._

//Create parameter grid 
    val paramGrid = new ParamGridBuilder()
        .addGrid(xgbRegressor.maxDepth, range(6, 10, 2))
        .addGrid(xgbRegressor.eta, Array(0.01))
        .addGrid(xgbRegressor.minChildWeight, Array(8.0, 10.0, 12.0, 14.0))
        .addGrid(xgbRegressor.gamma, Array(0, 0.25, 0.5, 1))
        .build()

I then fit it to my data and saved it,

 val trainValidationSplit = new TrainValidationSplit()
      .setEstimator(pipeline)
      .setEvaluator(evaluator)
      .setEstimatorParamMaps(paramGrid)
      .setTrainRatio(0.75)

val tvmodel = trainValidationSplit.fit(train)

tvmodel.write.overwrite().save("spark-train-validation-split-28072021")

The error comes up when I try to load the model again,

  import org.apache.spark.ml.tuning._
    val tvmodel = TrainValidationSplitModel.load("spark-train-validation-split-28072021")

The error message is XGBoostError: std::bad_alloc

I checked the executor and driver logs. The executor logs looked fine . I found the same error in driver logs (stderr and log4j files). Both the log files are attached here. stderr.txt log4j-2021-08-03-13.log

Since the error message was mainly found in the driver logs, I tried the following solutions,

  • Increased the driver memory to 28GB
  • Increased the driver cores to 8
  • Made the driver same as worker.
  • set spark.driver.maxResultSize to 0
  • spark.driver.memory to 20g

But all the above failed. The log files clearly indicate that driver memory is not overloaded. Hence I am struggling to find out what the error actually is.

It would be great if one of you could point me in the right direction.

jon-targaryen1995 avatar Aug 04 '21 07:08 jon-targaryen1995

On further investigation, I increased the driver size to 128GB and 32 cores. And got the same error.

I continuously monitored the driver metrics and none of it was over loaded.

Number of CPU used image

Memory usage image

CPU usage in capacity Atleast 85% of CPU capacity is idle image

Network image

The log4J has two different error messages as

  • ERROR ScalaDriverLocal: User Code Stack Trace: ml.dmlc.xgboost4j.java.XGBoostError: std::bad_alloc

  • ERROR Instrumentation: ml.dmlc.xgboost4j.java.XGBoostError: std::bad_alloc

The stderr has one error message that looks as below,

  • ERROR Uncaught throwable from user code: ml.dmlc.xgboost4j.java.XGBoostError: std::bad_alloc

I am starting to think this is a bug with the package and not my cluster configuration. Maybe the package is not compatible with scala save and load methods.

jon-targaryen1995 avatar Aug 05 '21 10:08 jon-targaryen1995

@wbo4958 Have you seen something similar before?

trivialfis avatar Aug 05 '21 14:08 trivialfis

@jon-targaryen1995 I'm doing the exact same thing (same versions of scala, spark and xgboost4j) and I see the same error when I try to load my model. Any luck with this? I'm using EMR on aws.

kumarprabhu1988 avatar Aug 19 '21 01:08 kumarprabhu1988

Could you please try nightly build: https://xgboost.readthedocs.io/en/latest/install.html#id4 ?

trivialfis avatar Aug 19 '21 05:08 trivialfis

No, I didn't see this, seems the issue may be related to https://github.com/dmlc/xgboost/pull/7067

wbo4958 avatar Aug 19 '21 05:08 wbo4958

Is this issue resolved ? I am getting the exact error.

prashantprakash avatar Nov 01 '21 20:11 prashantprakash

@jon-targaryen1995 What did you do to resolve the bad_alloc issues? (If you ever did) Nodes with more memory? Change a setting in your spark or xgboost params?

wsurles-vamp avatar Apr 28 '22 22:04 wsurles-vamp

@wsurles-vamp, which version were you using to repro this issue? Could you file an new issue and detailed steps to repro it? With that, I will check it. Thx

wbo4958 avatar Apr 29 '22 07:04 wbo4958

thanks so much @wbo4958 , (I am a coworker of @wsurles-vamp) this is sensitive data, I will need to create a fake dataset to reproduce to give to you and check what other details are ok to share in public. this is not on databricks.

I'll open a new issue soon with more details but here are some quick basics:

versions:

xgboost 1.5.1 (xgboost4j_2.12-1.5.1.jar / xgboost4j-spark_2.12-1.5.1.jar) xgboost 1.6 with fix from https://github.com/dmlc/xgboost/pull/7844

have seen with spark 3.1.2 and spark 3.2.1

we are using the pyspark wrapper from https://github.com/dmlc/xgboost/pull/4656#issuecomment-510693296 with only edit being adding in killSparkContextOnWorkerFailure param


some dataset info:

rows: 109 million rows with label=1 : 51 thousand cols: 254

'objective': 'binary:logistic' 'treeMethod': 'hist'

there are no cols of all 0s but some are very sparse.

we have a few datasets generating this error, this is just 1 of them, row/col count differs between them. we have some datasets of similar or bigger sizes not getting the error, and it can be intermittent with some rounds of hyperparam tuning getting it but others not, though we haven't notice a correlation between params like maxDepth and the error.


here is our specific stack trace:

ml.dmlc.xgboost4j.java.XGBoostError: std::bad_alloc
	at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
	at ml.dmlc.xgboost4j.java.DMatrix.(DMatrix.java:54)
	at ml.dmlc.xgboost4j.scala.DMatrix.(DMatrix.scala:43)
	at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:574)
	at ml.dmlc.xgboost4j.scala.spark.PreXGBoost$.$anonfun$trainForNonRanking$1(PreXGBoost.scala:480)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1481)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1481)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

kstock avatar Apr 29 '22 10:04 kstock

A couple other quick points of note... We use dense vectors (to ensure 0s are treated as 0.) We have a 26 node spark cluster with 120GiB of memory, though we have seen it on nodes with 180. And well before memory has run out, as many others have mentioned. Seen it on centos7 and rocky linux 8

wsurles-vamp avatar Apr 29 '22 14:04 wsurles-vamp

we have a better idea of what is going on now, we'll create a proper issue soon but might be a few days this is another related question https://discuss.xgboost.ai/t/useexternalmemory-no-longer-does-anything/2761

kstock avatar Apr 29 '22 19:04 kstock

FYI @kstock

From xgboost4j-spark 1.6.1, you don't need killSparkContextOnWorkerFailure parameter any more, since 1.6.1 has introduced the spark barrier execution mode to ensure every training task will be launched at the same time and spark will kill the failed training job if some training tasks failed. In a word, xgboost will not stop SparkContext any more.

we are using the pyspark wrapper from https://github.com/dmlc/xgboost/pull/4656#issuecomment-510693296 with only edit being adding in killSparkContextOnWorkerFailure param

Looks like the error happened in building DMatrix, From the error log "ml.dmlc.xgboost4j.java.XGBoostError: std::bad_alloc", looks like OOM issue?

@trivialfis any idea about this error log?

BTW, the dense vector may blow up the memory very quickly.

BTW, do you have got some chance to use xgboost4j-spark-gpu to train?

wbo4958 avatar Apr 30 '22 01:04 wbo4958

@wbo4958, We don't run out of memory per se, but xgboost does seem to increase the size of the memory it requests by a factor of 1.5 and eventually this is a very big jump and too much for the system to provide, and this is when we get the bad_alloc error.

In this graph of system memory (top) and committed memory (bottom), you can see the increase in committed memory in stages as system memory is used. There are two fits on this chart. The first fit fails (in the middle of the chart) when the big committed memory jump happens. (note: there is still ~40GiB available when it fails, the bottom of the chart is not 0) Then the second fit finishes a little sooner and never has that last large memory commit, and it succeeded.

Screen Shot 2022-04-29 at 8 29 09 PM

With a debugger attached to custom build we saw Resizing ParallelGroupBuilder.data_ to 4505075712 elements at the time of the failure. The elements count was steadily increasing. But I think it's the memory request thats causing the failure.

We are looking into getting more memory on our cluster and spreading out the xgboost load as much as possible, but is there something we could do to make the committed memory grow more evenly? Or have xgboost not request more than is available?

wsurles-vamp avatar Apr 30 '22 02:04 wsurles-vamp

Hi @trivialfis, any idea about https://github.com/dmlc/xgboost/issues/7155#issuecomment-1113898292.

wbo4958 avatar Apr 30 '22 07:04 wbo4958

The QuantileDMatrix is coming to CPU as well if refactoring goes well. The ParallelGroupBuilder is used during construction of DMatrix.

trivialfis avatar Apr 30 '22 08:04 trivialfis

Hi guys, do you have any update on this? I am getting this error when reading a Spark pipeline from mlflow. I get the error when I try to run the line loaded_model = mlflow.spark.load_model(run_path). Looks like XGBoost suddenly starts creating more and more jobs (a bit strange if I just want to load the model because I am doing nothing with it yet) and eventually crashes.

olbapjose avatar Jun 13 '22 10:06 olbapjose

Hi guys, do you have any update on this? I am getting this error when reading a Spark pipeline from mlflow. I get the error when I try to run the line loaded_model = mlflow.spark.load_model(run_path). Looks like XGBoost suddenly starts creating more and more jobs (a bit strange if I just want to load the model because I am doing nothing with it yet) and eventually crashes.

@olbapjose, loaded_model = mlflow.spark.load_model(run_path) is the python code? so you were trying to load the model directly, and it crashed?

wbo4958 avatar Jun 13 '22 22:06 wbo4958

@wbo4958 Exactly, I was trying to load a trained ML pipeline that I had saved previously, where one of the stages is a Spark XGBoost classifier, and it crashes when it tries to load the XGBoost stage. The stack trace looks like the one pasted in an aforementioned message. I'm working in a Databricks notebook.

EDIT: interestingly, the issue arises inside a Databricks notebook (see the attached file for the complete stack trace) but does not arise when using the databricks-connect package to load the model from a local pycharm IDE connected to the Databricks cluster.

stacktrace.txt

olbapjose avatar Jun 14 '22 05:06 olbapjose

@olbapjose Would you share the model with the zip format, then we can repro in databricks?

wbo4958 avatar Jun 15 '22 02:06 wbo4958

@wbo4958 I suspect it is closely related to #8074 , where I have uploaded the trained model I am trying to read. That issue arises both in a Databricks notebook and when using databricks-connect.

olbapjose avatar Jul 18 '22 08:07 olbapjose

Regarding https://github.com/dmlc/xgboost/issues/7155#issuecomment-1113898292 . I think the best way forward is to improve support for QuantileDMatrix in jvm packages.

trivialfis avatar Sep 18 '23 20:09 trivialfis