xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

Revamp the rabit implementation.

Open trivialfis opened this issue 11 months ago • 6 comments

motivation

With the increasing complexity of the networking module due to support for vertical and horizontal federated learning and GPU-based training for scaling, the existing rabit module is no longer sufficient. We have multiple dimensions of features to support:

  • data split.
  • federated learning.
  • GPU acceleration.

Among the above features, GPU acceleration and federated learning require loading optional external libraries. Lastly, we are trying to support resilience.

features

This PR replaces the original RABIT implementation with a new one, which has already been partially merged into XGBoost. The new one features:

  • Federated learning for both CPU and GPU.
  • NCCL.
  • More data types.
  • A unified interface for all the underlying implementations.
  • Improved timeout handling for both tracker and workers.
  • Exhausted tests with metrics (fixed a couple of bugs along the way).
  • A reusable tracker for Python and JVM packages.

todos:

  • [x] JVM
  • [x] Standardize the naming of worker parameters and tracker methods.
  • [x] Worker sortby.

working in progress

Retry is still in progress. This is to provide essential support for handling exception (e.g., a network error or an OOM). Segfault handling has to be done with additional cooperation with the distributed framework and is out of scope for this work.

note for review

  • Breaking changes are made to the tracker in Python and JVM interfaces.
  • communicator parameters are reworked. For a list of parameters, see the document in the C header.
  • Federated tracker requires an additional n_workers parameter.

trivialfis avatar Mar 11 '24 19:03 trivialfis

@rongou Is using rank{} as process name instead of host name a deliberate choice for federated learning?

trivialfis avatar Mar 12 '24 11:03 trivialfis

@rongou Is using rank{} as process name instead of host name a deliberate choice for federated learning?

Yes, in a federated setting, a participant may not want to expose the host name to the rest of the group.

rongou avatar Mar 12 '24 17:03 rongou

Tracking https://github.com/apache/arrow/issues/41058 Need to remove the war in CI once a new snappy is published.

trivialfis avatar Apr 08 '24 09:04 trivialfis

@wbo4958 Please help look into the changes to JVM packages when you are available.

trivialfis avatar Apr 28 '24 15:04 trivialfis

@rongou The PR is ready for an initial review. I can't extract more small PRs since the global communicator is swapped.

Need to investigate the JVM dependency and the flaky error.

trivialfis avatar Apr 29 '24 03:04 trivialfis

@rongou Hi, the PR is ready for review. It's quite large, if you need to have an online review please let me know.

trivialfis avatar May 09 '24 00:05 trivialfis