xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

[Spark 3.5.0] Rabit Tracker Connection Failure During Distributed XGBoost Training

Open lcx517 opened this issue 9 months ago • 11 comments

Environment Details

  • XGBoost Version: tried 2.1.0, 2.1.1, 2.1.3, 2.1.4
  • Spark: 3.5.0 (Cluster Mode: YARN)
  • Scala: 2.12.18
  • Java: OpenJDK 8
  • Cluster: YARN/Hadoop 3.2.2

Background

Our pipeline ran successfully with Spark 3.1.1 + XGBoost 1.1.1 in production. After upgrading to Spark 3.5.0, we tested multiple XGBoost versions (2.1.0-2.1.4) and consistently encountered the same Rabit tracker connection error during distributed training.

Error Description

Failure occurs when initializing distributed training:

ERROR XGBoostSpark: the job was aborted due to ml.dmlc.xgboost4j.java.XGBoostError: [12:58:58] /workspace/src/collective/result.cc:78:

  • [tracker.cc:286|12:58:58]: Failed to accept connection.
  • [socket.h:89|12:58:58]: Invalid polling request.

Full stack trace shows the error originates from RabitTracker.stop() after connection rejection.

Reproduction Steps

  1. Code:
        val assembler = new VectorAssembler()
          .setInputCols(Array("f1", "f2", "f3", "f4", "f5", "f6", "f7", "f8", "f9", "f10", "f11", "f12", "f13", "f14", "f15", "f16", "f17", "f18", "f19", "f20", "f21", "f22", "f23"))
          .setOutputCol("features")
        val labelIndexer = new StringIndexer()
          .setInputCol("y")
          .setOutputCol("indexedLabel")
          .setHandleInvalid("skip")
          .fit(training)
        val booster = new XGBoostClassifier(
            Map(
                "eta" -> 0.1f,
                "max_depth" -> 5,
                "objective" -> "multi:softprob",
                "num_class" -> 2,
                "device" -> "cpu"
            )
        ).setNumRound(10).setNumWorkers(2)
        booster.setFeaturesCol("features")
        booster.setLabelCol("indexedLabel")

        val converter = new IndexToString()
          .setInputCol("prediction")
          .setOutputCol("convetedPrediction")
          .setLabels(labelIndexer.labelsArray(0))

        val pipeline = new Pipeline()
          .setStages(Array(assembler, labelIndexer, booster, converter))
        println("ready to train...")
        val model: PipelineModel = pipeline.fit(training)      // stopped here
  1. Submit Command:

spark-submit --master yarn --deploy-mode cluster ...

Attempted Fixes

✅ Verified compatibility between Spark 3.5.0 and XGBoost 2.1.x ✅ Tested all minor versions of XGBoost 2.1.x series ❌ Adjusting tracker ports (tracker_conf) had no effect ❌ Increasing timeout (timeout parameter) failed

Key Questions

  1. Is this a known issue with Spark 3.5.0’s network layer and XGBoost 2.1.x?
  2. Are there specific configurations required for XGBoost 2.1.x + Spark 3.5.0?
  3. Should we downgrade to Spark 3.1.x or wait for a XGBoost patch?

This template focuses on critical version conflicts and provides actionable context for maintainers.

attaching log:

25/03/31 12:58:58 ERROR RabitTracker: ml.dmlc.xgboost4j.java.XGBoostError: [12:58:58] /workspace/src/collective/result.cc:78:

  • [tracker.cc:286|12:58:58]: Failed to accept connection.
  • [socket.h:89|12:58:58]: Invalid polling request. Stack trace: [bt] (0) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6e) [0x7f9111d241ee] [bt] (1) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(xgboost::collective::SafeColl(xgboost::collective::Result const&)+0x7e) [0x7f9111db9f7e] [bt] (2) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(+0x2b435c) [0x7f9111d5235c] [bt] (3) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(XGTrackerWaitFor+0x1ba) [0x7f9111d5384a] [bt] (4) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_TrackerWaitFor+0x196) [0x7f911244e856] [bt] (5) [0x7f91450186c7]

25/03/31 12:58:58 ERROR XGBoostSpark: the job was aborted due to ml.dmlc.xgboost4j.java.XGBoostError: [12:58:58] /workspace/src/collective/result.cc:78:

  • [tracker.cc:286|12:58:58]: Failed to accept connection.

  • [socket.h:89|12:58:58]: Invalid polling request. Stack trace: [bt] (0) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6e) [0x7f9111d241ee] [bt] (1) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(xgboost::collective::SafeColl(xgboost::collective::Result const&)+0x7e) [0x7f9111db9f7e] [bt] (2) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(+0x2b435c) [0x7f9111d5235c] [bt] (3) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(XGTrackerFree+0x15d) [0x7f9111d529bd] [bt] (4) [0x7f91450186c7]

    at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48) at ml.dmlc.xgboost4j.java.RabitTracker.stop(RabitTracker.java:84) at ml.dmlc.xgboost4j.scala.spark.XGBoost$.withTracker(XGBoost.scala:467) at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:501) at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:210) at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:34) at org.apache.spark.ml.Predictor.fit(Predictor.scala:114) at org.apache.spark.ml.Predictor.fit(Predictor.scala:78) at org.apache.spark.ml.Pipeline.$anonfun$fit$5(Pipeline.scala:151) at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130) at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123) at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42) at org.apache.spark.ml.Pipeline.$anonfun$fit$4(Pipeline.scala:151) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at org.apache.spark.ml.Pipeline.$anonfun$fit$2(Pipeline.scala:147) at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130) at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123) at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42) at org.apache.spark.ml.Pipeline.$anonfun$fit$1(Pipeline.scala:133) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:133) at Test$.main(Test.scala:59) at Test.main(Test.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:738)

lcx517 avatar Mar 31 '25 07:03 lcx517

We have been running tests with 3.5 but haven't observed similar error yet. The errors come from polling a UNIX TCP socket. I can't guess the cause based on the available information. Is there a way that we can reproduce your networking environment?

trivialfis avatar Mar 31 '25 12:03 trivialfis

Or, can you test on your environment using the latest xgboost4j 3.0 from https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/list.html?prefix=release/ml/dmlc/xgboost4j-spark_2.12/3.0.0/

wbo4958 avatar Mar 31 '25 12:03 wbo4958

During debugging a Spark-XGBoost pipeline, I encountered a version-specific warning that never appeared in previous environments:

Warning Log:

WARN DAGScheduler: Creating new stage failed due to exception - job: 2 org.apache.spark.scheduler.BarrierJobRunWithDynamicAllocationException: [SPARK-24942]: Barrier execution mode does not support dynamic resource allocation for now. You can disable dynamic resource allocation by setting Spark conf "spark.dynamicAllocation.enabled" to "false".

Resolution: Disabling dynamic resource allocation resolved this conflict:

val sparkSession = SparkSession
          .builder()
          .appName("xgboostTest")
          .enableHiveSupport()
          .config("spark.dynamicAllocation.enabled", "false")
          .getOrCreate()

Subsequently, a stricter data validation error emerged during training:

Error Log:

ERROR DataBatch: java.lang.RuntimeException: you can only specify missing value as 0.0 (the currently set value NaN) when you have SparseVector or Empty vector as your feature format. If you didn't use Spark's VectorAssembler class to build your feature vector but instead did so in a way that preserves zeros in your feature vector you can avoid this check by using the 'allow_non_zero_for_missing parameter' (only use if you know what you are doing)

Resolution: Bypassing the validation check enabled successful training: Map("allow_non_zero_for_missing" -> true)

Btw, I confirm my training set uses dense vectors (DenseVector) and contains no NaN values, yet allow_non_zero_for_missing must still be set.

lcx517 avatar Apr 02 '25 07:04 lcx517

Btw, I confirm my training set uses dense vectors (DenseVector) and contains no NaN values, yet allow_non_zero_for_missing must still be set.

@wbo4958 could you please help take a look when you are available?

trivialfis avatar Apr 07 '25 18:04 trivialfis

Hi @lcx517, Looks like you're not using the latest xgboost, since allow_non_zero_for_missing has been totally removed.

wbo4958 avatar Apr 07 '25 23:04 wbo4958

Hi @lcx517, Looks like you're not using the latest xgboost, since allow_non_zero_for_missing has been totally removed.

Understood. I'm using the highest version 2.1.4 under Java 1.8. Since my cluster environment cannot support JRE 54+, I'm unable to validate the allow_non_zero_for_missing parameter in version 3.+.

lcx517 avatar Apr 08 '25 02:04 lcx517

looks like xgboost 3.x still can run with java 8.

wbo4958 avatar Apr 09 '25 02:04 wbo4958

looks like xgboost 3.x still can run with java 8.

I'm sorry I remembered incorrectly. It's other third-party libraries in version 3.+ that are based on Java 11.
XGBoost 3.0 can indeed run under Java 8. After disabling the allow_non_zero_for_missing parameter, the execution throws an error, which does not exist in 2.1.4:

ERROR DataBatch: java.lang.IllegalArgumentException: requirement failed: indices and values must have the same number of elements 25/04/11 16:13:49 WARN BlockManager: Putting block rdd_87_1 failed due to exception ml.dmlc.xgboost4j.java.XGBoostError: [16:13:49] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:186: [16:13:49] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:89: Check failed: jenv->ExceptionOccurred(): Stack trace: [bt] (0) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_114198086/container_e47_1710209318993_114198086_01_000004/tmp/libxgboost4j1617439200711600424.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6e) [0x7f88b144d52e] [bt] (1) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_114198086/container_e47_1710209318993_114198086_01_000004/tmp/libxgboost4j1617439200711600424.so(XGBoost4jCallbackDataIterNext+0xb5b) [0x7f88b14465cb] [bt] (2) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_114198086/container_e47_1710209318993_114198086_01_000004/tmp/libxgboost4j1617439200711600424.so(xgboost::data::IteratorAdapter<void*, int (void*, int ()(void, XGBoostBatchCSR), void*), XGBoostBatchCSR>::Next()+0x1b) [0x7f88b173343b] [bt] (3) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_114198086/container_e47_1710209318993_114198086_01_000004/tmp/libxgboost4j1617439200711600424.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter<void*, int (void*, int ()(void, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int ()(void, XGBoostBatchCSR), void*), XGBoostBatchCSR>, float, int, xgboost::DataSplitMode)+0x2c3) [0x7f88b18aad13] [bt] (4) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_114198086/container_e47_1710209318993_114198086_01_000004/tmp/libxgboost4j1617439200711600424.so(xgboost::DMatrix xgboost::DMatrix::Create<xgboost::data::IteratorAdapter<void*, int (void*, int ()(void, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int ()(void, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int, std::string const&, xgboost::DataSplitMode)+0x3e) [0x7f88b18151ce] [bt] (5) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_114198086/container_e47_1710209318993_114198086_01_000004/tmp/libxgboost4j1617439200711600424.so(XGDMatrixCreateFromDataIter+0x145) [0x7f88b145e155] [bt] (6) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_114198086/container_e47_1710209318993_114198086_01_000004/tmp/libxgboost4j1617439200711600424.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x229) [0x7f88b14427c9] [bt] (7) [0x7f88d0be3427]

Stack trace: [bt] (0) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_114198086/container_e47_1710209318993_114198086_01_000004/tmp/libxgboost4j1617439200711600424.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6e) [0x7f88b144d52e] [bt] (1) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_114198086/container_e47_1710209318993_114198086_01_000004/tmp/libxgboost4j1617439200711600424.so(+0x19f1db) [0x7f88b140d1db] [bt] (2) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_114198086/container_e47_1710209318993_114198086_01_000004/tmp/libxgboost4j1617439200711600424.so(xgboost::data::IteratorAdapter<void*, int (void*, int ()(void, XGBoostBatchCSR), void*), XGBoostBatchCSR>::Next()+0x1b) [0x7f88b173343b] [bt] (3) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_114198086/container_e47_1710209318993_114198086_01_000004/tmp/libxgboost4j1617439200711600424.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter<void*, int (void*, int ()(void, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int ()(void, XGBoostBatchCSR), void*), XGBoostBatchCSR>, float, int, xgboost::DataSplitMode)+0x2c3) [0x7f88b18aad13] [bt] (4) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_114198086/container_e47_1710209318993_114198086_01_000004/tmp/libxgboost4j1617439200711600424.so(xgboost::DMatrix xgboost::DMatrix::Create<xgboost::data::IteratorAdapter<void*, int (void*, int ()(void, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int ()(void, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int, std::string const&, xgboost::DataSplitMode)+0x3e) [0x7f88b18151ce] [bt] (5) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_114198086/container_e47_1710209318993_114198086_01_000004/tmp/libxgboost4j1617439200711600424.so(XGDMatrixCreateFromDataIter+0x145) [0x7f88b145e155] [bt] (6) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_114198086/container_e47_1710209318993_114198086_01_000004/tmp/libxgboost4j1617439200711600424.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x229) [0x7f88b14427c9] [bt] (7) [0x7f88d0be3427]

lcx517 avatar Apr 11 '25 09:04 lcx517

"ERROR DataBatch: java.lang.IllegalArgumentException: requirement failed: indices and values must have the same number of elements", looks like something wrong about converting the data from spark into xgboost databatch.

Could you help share the dataset you are using or a synthetic dataset so we can triage it? Thx very much @lcx517

wbo4958 avatar Apr 12 '25 02:04 wbo4958

"ERROR DataBatch: java.lang.IllegalArgumentException: requirement failed: indices and values must have the same number of elements", looks like something wrong about converting the data from spark into xgboost databatch.

Could you help share the dataset you are using or a synthetic dataset so we can triage it? Thx very much @lcx517

demo_creditcard.csv

test.scala.zip

Thank you very much

lcx517 avatar Apr 14 '25 07:04 lcx517

Hi everyone,

I’m using sparse vectors with about 10 features out of a possible 50 million. However, the conversion to dense vectors is causing heap exhaustion. Is there a way to disable the sparse-to-dense conversion?

Right now, I can’t even train on a small batch of vectors without running into memory issues — but I ultimately need to train on 200 million rows.

Any help would be greatly appreciated. I’m using version 3.0.0 with the Java.

Thanks!

stepanov1997 avatar May 18 '25 00:05 stepanov1997