xgboost
xgboost copied to clipboard
XGBoost-Spark training fails sometimes due to "java.lang.NumberFormatException: For input string: "inf""
We are meeting this exception when running a regression training repeatly with the xgboost JVM jars built from the latest master branch using Scala 2.11, along with our spark example app jar.
This issue doesn't 100% occur, but nearly 50% rate under our test environment as below:
- OS: Ubuntu18.04,
- CPU: 12 cores
- Memory: 64G (Available ~40G)
- Spark: 2.4.3 with Hadhoop2.7, standalone mode with one worker on the same node.
- Data: 1.5G parquet files
- Tree Method: hist
- numRound: 100
The command is spark-submit --class ai.rapids.spark.examples.taxi.CPUMain --master spark://10.19.183.78:7077 --executor-memory 32G --jars /data/github/xgboost4j_2.11-1.0.0-SNAPSHOT.jar,/data/github/xgboost4j-spark_2.11-1.0.0-SNAPSHOT.jar /data/github/sample_xgboost_apps-0.1.4.jar -trainDataPath=/data/taxi/parquet/tr/ -evalDataPath=/data/taxi/parquet/test -format=parquet -showFeatures=0 -numWorkers=11 -treeMethod=hist -numRound=100
.
and error log follows
_19/09/11 17:29:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_0 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_3 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_1 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_5 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_4 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_6 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_10 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_9 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_2 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_7 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Putting block rdd_13_8 failed due to exception java.lang.NumberFormatException: For input string: "inf". 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_2 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_3 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_1 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_9 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_5 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_4 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_10 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_6 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_0 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_8 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 WARN BlockManager: Block rdd_13_7 could not be removed as it was not found on disk or in memory 19/09/11 17:31:50 ERROR Executor: Exception in task 9.0 in stage 3.0 (TID 24) java.lang.NumberFormatException: For input string: "inf" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122) at java.lang.Float.parseFloat(Float.java:451) at java.lang.Float.valueOf(Float.java:416) at ml.dmlc.xgboost4j.java.Booster.evalSet(Booster.java:246) at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:202) at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:64) at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$buildDistributedBooster(XGBoost.scala:330) at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:413) at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForNonRanking$1.apply(XGBoost.scala:409) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 19/09/11 17:31:50 ERROR Executor: Exception in task 7.0 in stage 3.0 (TID 22) java.lang.NumberFormatException: For input string: "inf" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122) at java.lang.Float.parseFloat(Float.java:451) at java.lang.Float.valueOf(Float.java:416) at ml.dmlc.xgboost4j.java.Booster.evalSet(Booster.java:246) at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:202) _ Concerning the jars and data files, pls refer to https://drive.google.com/open?id=1WKXyll4rmhwLU2qJTrLb6YYh-wsB8XQ3 .
For more on our example app, pls refer to https://github.com/rapidsai/spark-examples .
are you using a customized eval func?
The answer is no if you mean parameter 'custom_eval', i didn't set it.
so the reason is that when evaluating the model, xgboost produces a "inf" for, e.g. , rmse, etc.
you may need to track why it is the case,
so the reason is that when evaluating the model, xgboost produces a "inf" for, e.g. , rmse, etc.
you may need to track why it is the case,
Thanks a lot but i'm not familiar with xgboost native, anyone can help on this?
Could you reproduce it with non JVM environment?
Could you reproduce it with non JVM environment?
will try
@firestarman I meet the same problem as you, have you resolve it?
I'm facing the same issue as well! And this is only occuring when I use objective as count:poisson
and not necessarily for others. (@firestarman and @trivialfis, maybe this hint is helpful, sorry not able to debug more as I'm not good with Scala or Java)
I'm running this on Spark 3.0 with xgboost4j_spark_2_12_1_3_1.jar and xgboost4j_2_12_1_3_1.jar.
21/02/13 06:22:20 WARN BlockManager: Putting block rdd_29_4 failed due to exception java.lang.NumberFormatException: For input string: "inf".
21/02/13 06:22:20 WARN BlockManager: Putting block rdd_29_0 failed due to exception java.lang.NumberFormatException: For input string: "inf".
21/02/13 06:22:20 WARN BlockManager: Putting block rdd_29_6 failed due to exception java.lang.NumberFormatException: For input string: "inf".
21/02/13 06:22:20 WARN BlockManager: Putting block rdd_29_2 failed due to exception java.lang.NumberFormatException: For input string: "inf".
21/02/13 06:22:20 WARN BlockManager: Block rdd_29_4 could not be removed as it was not found on disk or in memory
21/02/13 06:22:20 WARN BlockManager: Block rdd_29_6 could not be removed as it was not found on disk or in memory
21/02/13 06:22:20 WARN BlockManager: Block rdd_29_0 could not be removed as it was not found on disk or in memory
21/02/13 06:22:20 WARN BlockManager: Block rdd_29_2 could not be removed as it was not found on disk or in memory
21/02/13 06:22:20 ERROR Executor: Exception in task 4.0 in stage 8.0 (TID 27)
java.lang.NumberFormatException: For input string: "inf"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
at java.lang.Float.parseFloat(Float.java:451)
at java.lang.Float.valueOf(Float.java:416)
at ml.dmlc.xgboost4j.java.Booster.evalSet(Booster.java:251)
at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:215)
at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:284)
at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:66)
at scala.Option.getOrElse(Option.scala:189)
at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:62)
at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:106)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:416)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainForNonRanking$1(XGBoost.scala:499)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:844)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:844)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:369)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1376)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1303)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1367)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1187)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:318)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
at org.apache.spark.scheduler.Task.run(Task.scala:117)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:640)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:643)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I am facing the same above issue for objective "count:poisson" but works fine for the objective function. I am using data bricks environment, importing xgboost as import ml.dmlc.xgboost4j.scala.spark.{XGBoostRegressor}.
can anybody help?
java.lang.NumberFormatException: For input string: "inf" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122) at java.lang.Float.parseFloat(Float.java:451) at java.lang.Float.valueOf(Float.java:416) at ml.dmlc.xgboost4j.java.Booster.evalSet(Booster.java:243) at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:231) at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:304) at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:66) at scala.Option.getOrElse(Option.scala:189) at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:62) at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:106) at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:416) at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainForNonRanking$1(XGBoost.scala:499) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:868) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:868) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:380) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:393) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1486) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1413) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1477) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1296) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:391) at org.apache.spark.rdd.RDD.iterator(RDD.scala:342) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:93) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
I'm facing the same issue using count:poisson
.
In my case, using CustomEval makes training success.
(I use CustomEval, because I'd like poisson-nloglik
evaluation but it cannot be used)
I am training the GeneralLinearRegression model on Spark. I want to print the summary of the model with the model.summay. It gave an error, and I could not find any helpful resources online. Here is the error message I found. Any suggestions on how can solve it? Thank you Basazin
` java.lang.NumberFormatException
Py4JJavaError Traceback (most recent call last)
/databricks/spark/python/pyspark/ml/regression.py in repr(self) 2472 2473 def repr(self): -> 2474 return self._call_java("toString") 2475 2476
/databricks/spark/python/pyspark/ml/wrapper.py in _call_java(self, name, *args) 52 sc = SparkContext._active_spark_context 53 java_args = [_py2java(sc, arg) for arg in args] ---> 54 return _java2py(sc, m(*java_args)) 55 56 @staticmethod
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in call(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 115 def deco(*a, **kw): 116 try: --> 117 return f(*a, **kw) 118 except py4j.protocol.Py4JJavaError as e: 119 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o10522.toString.
: java.lang.NumberFormatException
at java.math.BigDecimal.