xgboost
xgboost copied to clipboard
[WIP] Common interface for collective communication
Part of #7778.
Another benefit is that sometimes nccl has issues with users' network setup and the error is difficult to interpret (unhandled system error) and is only raised during training. I want to have a Python interface unified with rabit for GPU communication that we can easily test and debug.
Yes I do plan to hook this up with a python interface. Do you think we should keep the existing rabit api as is and just swap out the implementation? Or deprecate rabit and have a new interface, say, collective?
I think we can just do a swap. But since you have been working on it you might have better plan.
@trivialfis @RAMitchell finally got the CI to pass. This is more or less the complete new unified api. I haven't changed the callers yet. Would you mind taking a quick look to see if it makes sense at a high level? Thanks!
Thank you for the great work here! I will dive into the PR tomorrow.
@trivialfis switched to use JSON to configure the communicators. Please take another look. Thanks!
Next I'll try to switch to the dmlc registry.
Next I'll try to switch to the dmlc registry.
I don't have a strong preference for this. Feel free to create your own factory methods if needed. I just wanted to avoid code duplication due to GPU support.
@trivialfis got rid of the factories. Please take another look. Should we try to merge this? So far this is a non-breaking change. Once I start modifying the callers, it'd probably become a breaking change.
@trivialfis can this be merged? Is there anything else you want me to change? Thanks!
The main goal is to include federated learning in a release so users can just pip install xgboost to get the functionality. A few steps left to do:
- Add the JNI wrapper for the communicator api
- Change all the callers to use communicator instead of rabit
- Change CI to build with federated learning enabled by default
- Release with federated learning included
Once this is done, we can probably do more with Rabit (rewrite, swap out), unify host/device apis.
For federated learning, we may look at things like homomorphic encryption, differential privacy, and vertical federated learning (features are split between participants).
Restarted jenkins linux.
homomorphic encryption, differential privacy
The main goal is to include federated learning in a release so users can just
pip install xgboostto get the functionality. A few steps left to do:
- Add the JNI wrapper for the communicator api
- Change all the callers to use communicator instead of rabit
- Change CI to build with federated learning enabled by default
- Release with federated learning included
Once this is done, we can probably do more with Rabit (rewrite, swap out), unify host/device apis.
For federated learning, we may look at things like homomorphic encryption, differential privacy, and vertical federated learning (features are split between participants).
do we have homomorphic encryption or differential privacy in the plugin now?