xgboost
xgboost copied to clipboard
Revamp the rabit implementation.
motivation
With the increasing complexity of the networking module due to support for vertical and horizontal federated learning and GPU-based training for scaling, the existing rabit module is no longer sufficient. We have multiple dimensions of features to support:
- data split.
- federated learning.
- GPU acceleration.
Among the above features, GPU acceleration and federated learning require loading optional external libraries. Lastly, we are trying to support resilience.
features
This PR replaces the original RABIT implementation with a new one, which has already been partially merged into XGBoost. The new one features:
- Federated learning for both CPU and GPU.
- NCCL.
- More data types.
- A unified interface for all the underlying implementations.
- Improved timeout handling for both tracker and workers.
- Exhausted tests with metrics (fixed a couple of bugs along the way).
- A reusable tracker for Python and JVM packages.
todos:
- [x] JVM
- [x] Standardize the naming of worker parameters and tracker methods.
- [x] Worker sortby.
working in progress
Retry is still in progress. This is to provide essential support for handling exception (e.g., a network error or an OOM). Segfault handling has to be done with additional cooperation with the distributed framework and is out of scope for this work.
note for review
- Breaking changes are made to the tracker in Python and JVM interfaces.
- communicator parameters are reworked. For a list of parameters, see the document in the C header.
- Federated tracker requires an additional
n_workers
parameter.
@rongou Is using rank{}
as process name instead of host name a deliberate choice for federated learning?
@rongou Is using
rank{}
as process name instead of host name a deliberate choice for federated learning?
Yes, in a federated setting, a participant may not want to expose the host name to the rest of the group.
Tracking https://github.com/apache/arrow/issues/41058 Need to remove the war in CI once a new snappy is published.
@wbo4958 Please help look into the changes to JVM packages when you are available.
@rongou The PR is ready for an initial review. I can't extract more small PRs since the global communicator is swapped.
Need to investigate the JVM dependency and the flaky error.
@rongou Hi, the PR is ready for review. It's quite large, if you need to have an online review please let me know.