xgboost [WIP] Common interface for collective communication

Part of #7778.

Jul 08 '22 16:07 rongou

Another benefit is that sometimes nccl has issues with users' network setup and the error is difficult to interpret (unhandled system error) and is only raised during training. I want to have a Python interface unified with rabit for GPU communication that we can easily test and debug.

Jul 17 '22 02:07 trivialfis

Yes I do plan to hook this up with a python interface. Do you think we should keep the existing rabit api as is and just swap out the implementation? Or deprecate rabit and have a new interface, say, collective?

Jul 18 '22 05:07 rongou

I think we can just do a swap. But since you have been working on it you might have better plan.

Jul 18 '22 14:07 trivialfis

@trivialfis @RAMitchell finally got the CI to pass. This is more or less the complete new unified api. I haven't changed the callers yet. Would you mind taking a quick look to see if it makes sense at a high level? Thanks!

Aug 17 '22 21:08 rongou

Thank you for the great work here! I will dive into the PR tomorrow.

Aug 18 '22 17:08 trivialfis

@trivialfis switched to use JSON to configure the communicators. Please take another look. Thanks!

Next I'll try to switch to the dmlc registry.

Aug 30 '22 23:08 rongou

Next I'll try to switch to the dmlc registry.

I don't have a strong preference for this. Feel free to create your own factory methods if needed. I just wanted to avoid code duplication due to GPU support.

Sep 01 '22 08:09 trivialfis

@trivialfis got rid of the factories. Please take another look. Should we try to merge this? So far this is a non-breaking change. Once I start modifying the callers, it'd probably become a breaking change.

Sep 02 '22 00:09 rongou

@trivialfis can this be merged? Is there anything else you want me to change? Thanks!

Sep 08 '22 16:09 rongou

The main goal is to include federated learning in a release so users can just pip install xgboost to get the functionality. A few steps left to do:

Add the JNI wrapper for the communicator api
Change all the callers to use communicator instead of rabit
Change CI to build with federated learning enabled by default
Release with federated learning included

Once this is done, we can probably do more with Rabit (rewrite, swap out), unify host/device apis.

For federated learning, we may look at things like homomorphic encryption, differential privacy, and vertical federated learning (features are split between participants).

Sep 09 '22 21:09 rongou

Restarted jenkins linux.

Sep 12 '22 20:09 trivialfis

homomorphic encryption, differential privacy

The main goal is to include federated learning in a release so users can just pip install xgboost to get the functionality. A few steps left to do:

Add the JNI wrapper for the communicator api

Change all the callers to use communicator instead of rabit

Change CI to build with federated learning enabled by default

Release with federated learning included

Once this is done, we can probably do more with Rabit (rewrite, swap out), unify host/device apis.

For federated learning, we may look at things like homomorphic encryption, differential privacy, and vertical federated learning (features are split between participants).

do we have homomorphic encryption or differential privacy in the plugin now?

Jan 15 '24 07:01 lidh15

xgboost xgboost copied to clipboard

[WIP] Common interface for collective communication

xgboost
xgboost copied to clipboard