xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

XGBoost-Spark training fails sometimes due to "java.lang.NumberFormatException: For input string: "inf""

Open firestarman opened this issue 5 years ago • 11 comments

We are meeting this exception when running a regression training repeatly with the xgboost JVM jars built from the latest master branch using Scala 2.11, along with our spark example app jar.

This issue doesn't 100% occur, but nearly 50% rate under our test environment as below:

  • OS: Ubuntu18.04,
  • CPU: 12 cores
  • Memory: 64G (Available ~40G)
  • Spark: 2.4.3 with Hadhoop2.7, standalone mode with one worker on the same node.
  • Data: 1.5G parquet files
  • Tree Method: hist
  • numRound: 100

The command is spark-submit --class ai.rapids.spark.examples.taxi.CPUMain --master spark://10.19.183.78:7077 --executor-memory 32G --jars /data/github/xgboost4j_2.11-1.0.0-SNAPSHOT.jar,/data/github/xgboost4j-spark_2.11-1.0.0-SNAPSHOT.jar /data/github/sample_xgboost_apps-0.1.4.jar -trainDataPath=/data/taxi/parquet/tr/ -evalDataPath=/data/taxi/parquet/test -format=parquet -showFeatures=0 -numWorkers=11 -treeMethod=hist -numRound=100 .

and error log follows

_19/09/11 17:29:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_0 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_3 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_1 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_5 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_4 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_6 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_10 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_9 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_2 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_7 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_8 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_2 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_3 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_1 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_9 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_5 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_4 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_10 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_6 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_0 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_8 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_7 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 ERROR Executor: Exception in task 9.0 in stage 3.0 (TID 24) java.lang.NumberFormatException: For input string: "inf" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122) at java.lang.Float.parseFloat(Float.java:451) at java.lang.Float.valueOf(Float.java:416) at ml.dmlc.xgboost4j.java.Booster.evalSet(Booster.java:246) at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:202) at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:64) at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$buildDistributedBooster(XGBoost.scala:330) at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:413) at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:409) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 19/09/11 17:31:50 ERROR Executor: Exception in task 7.0 in stage 3.0 (TID 22) java.lang.NumberFormatException: For input string: "inf" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122) at java.lang.Float.parseFloat(Float.java:451) at java.lang.Float.valueOf(Float.java:416) at ml.dmlc.xgboost4j.java.Booster.evalSet(Booster.java:246) at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:202) _ Concerning the jars and data files, pls refer to https://drive.google.com/open?id=1WKXyll4rmhwLU2qJTrLb6YYh-wsB8XQ3 .

For more on our example app, pls refer to https://github.com/rapidsai/spark-examples .

firestarman avatar Sep 11 '19 13:09 firestarman

are you using a customized eval func?

CodingCat avatar Sep 11 '19 13:09 CodingCat

The answer is no if you mean parameter 'custom_eval', i didn't set it.

firestarman avatar Sep 11 '19 13:09 firestarman

so the reason is that when evaluating the model, xgboost produces a "inf" for, e.g. , rmse, etc.

you may need to track why it is the case,

CodingCat avatar Sep 11 '19 13:09 CodingCat

so the reason is that when evaluating the model, xgboost produces a "inf" for, e.g. , rmse, etc.

you may need to track why it is the case,

Thanks a lot but i'm not familiar with xgboost native, anyone can help on this?

firestarman avatar Sep 11 '19 14:09 firestarman

Could you reproduce it with non JVM environment?

trivialfis avatar Sep 11 '19 16:09 trivialfis

Could you reproduce it with non JVM environment?

will try

firestarman avatar Sep 12 '19 01:09 firestarman

@firestarman I meet the same problem as you, have you resolve it?

huaxz1986 avatar Apr 22 '20 13:04 huaxz1986

I'm facing the same issue as well! And this is only occuring when I use objective as count:poisson and not necessarily for others. (@firestarman and @trivialfis, maybe this hint is helpful, sorry not able to debug more as I'm not good with Scala or Java) I'm running this on Spark 3.0 with xgboost4j_spark_2_12_1_3_1.jar and xgboost4j_2_12_1_3_1.jar.

21/02/13 06:22:20 WARN BlockManager: Putting block rdd_29_4 failed due to exception java.lang.NumberFormatException: For input string: "inf".
21/02/13 06:22:20 WARN BlockManager: Putting block rdd_29_0 failed due to exception java.lang.NumberFormatException: For input string: "inf".
21/02/13 06:22:20 WARN BlockManager: Putting block rdd_29_6 failed due to exception java.lang.NumberFormatException: For input string: "inf".
21/02/13 06:22:20 WARN BlockManager: Putting block rdd_29_2 failed due to exception java.lang.NumberFormatException: For input string: "inf".
21/02/13 06:22:20 WARN BlockManager: Block rdd_29_4 could not be removed as it was not found on disk or in memory
21/02/13 06:22:20 WARN BlockManager: Block rdd_29_6 could not be removed as it was not found on disk or in memory
21/02/13 06:22:20 WARN BlockManager: Block rdd_29_0 could not be removed as it was not found on disk or in memory
21/02/13 06:22:20 WARN BlockManager: Block rdd_29_2 could not be removed as it was not found on disk or in memory
21/02/13 06:22:20 ERROR Executor: Exception in task 4.0 in stage 8.0 (TID 27)
java.lang.NumberFormatException: For input string: "inf"
	at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
	at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
	at java.lang.Float.parseFloat(Float.java:451)
	at java.lang.Float.valueOf(Float.java:416)
	at ml.dmlc.xgboost4j.java.Booster.evalSet(Booster.java:251)
	at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:215)
	at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:284)
	at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:66)
	at scala.Option.getOrElse(Option.scala:189)
	at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:62)
	at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:106)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:416)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainForNonRanking$1(XGBoost.scala:499)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:844)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:844)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:369)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1376)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1303)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1367)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1187)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:318)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
	at org.apache.spark.scheduler.Task.run(Task.scala:117)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:640)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:643)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

saikiranvadhi avatar Feb 13 '21 15:02 saikiranvadhi

I am facing the same above issue for objective "count:poisson" but works fine for the objective function. I am using data bricks environment, importing xgboost as import ml.dmlc.xgboost4j.scala.spark.{XGBoostRegressor}.

can anybody help?

java.lang.NumberFormatException: For input string: "inf" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122) at java.lang.Float.parseFloat(Float.java:451) at java.lang.Float.valueOf(Float.java:416) at ml.dmlc.xgboost4j.java.Booster.evalSet(Booster.java:243) at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:231) at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:304) at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:66) at scala.Option.getOrElse(Option.scala:189) at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:62) at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:106) at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:416) at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainForNonRanking$1(XGBoost.scala:499) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:868) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:868) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:380) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:393) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1486) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1413) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1477) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1296) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:391) at org.apache.spark.rdd.RDD.iterator(RDD.scala:342) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:93) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

musram avatar Feb 05 '22 08:02 musram

I'm facing the same issue using count:poisson. In my case, using CustomEval makes training success.

(I use CustomEval, because I'd like poisson-nloglik evaluation but it cannot be used)

colticol avatar Aug 05 '22 02:08 colticol

I am training the GeneralLinearRegression model on Spark. I want to print the summary of the model with the model.summay. It gave an error, and I could not find any helpful resources online. Here is the error message I found. Any suggestions on how can solve it? Thank you Basazin

` java.lang.NumberFormatException

Py4JJavaError Traceback (most recent call last) in 8 # print(glr_model.summary) 9 summary = glr_model.summary ---> 10 print(glr_model.summary)

/databricks/spark/python/pyspark/ml/regression.py in repr(self) 2472 2473 def repr(self): -> 2474 return self._call_java("toString") 2475 2476

/databricks/spark/python/pyspark/ml/wrapper.py in _call_java(self, name, *args) 52 sc = SparkContext._active_spark_context 53 java_args = [_py2java(sc, arg) for arg in args] ---> 54 return _java2py(sc, m(*java_args)) 55 56 @staticmethod

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in call(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 115 def deco(*a, **kw): 116 try: --> 117 return f(*a, **kw) 118 except py4j.protocol.Py4JJavaError as e: 119 converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o10522.toString. : java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:497) at java.math.BigDecimal.(BigDecimal.java:827) at scala.math.BigDecimal$.decimal(BigDecimal.scala:53) at scala.math.BigDecimal$.apply(BigDecimal.scala:250) at org.apache.spark.ml.regression.GeneralizedLinearRegressionTrainingSummary.round$1(GeneralizedLinearRegression.scala:1542) at org.apache.spark.ml.regression.GeneralizedLinearRegressionTrainingSummary.$anonfun$toString$2(GeneralizedLinearRegression.scala:1551) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) at scala.collection.AbstractIterator.to(Iterator.scala:1431) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) at org.apache.spark.ml.regression.GeneralizedLinearRegressionTrainingSummary.$anonfun$toString$1(GeneralizedLinearRegression.scala:1560) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at org.apache.spark.ml.regression.GeneralizedLinearRegressionTrainingSummary.toString(GeneralizedLinearRegression.scala:1547) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748)`

Bassa1 avatar Aug 09 '22 22:08 Bassa1