xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

[WIP] Common interface for collective communication

Open rongou opened this issue 3 years ago • 3 comments

Part of #7778.

rongou avatar Jul 08 '22 16:07 rongou

Another benefit is that sometimes nccl has issues with users' network setup and the error is difficult to interpret (unhandled system error) and is only raised during training. I want to have a Python interface unified with rabit for GPU communication that we can easily test and debug.

trivialfis avatar Jul 17 '22 02:07 trivialfis

Yes I do plan to hook this up with a python interface. Do you think we should keep the existing rabit api as is and just swap out the implementation? Or deprecate rabit and have a new interface, say, collective?

rongou avatar Jul 18 '22 05:07 rongou

I think we can just do a swap. But since you have been working on it you might have better plan.

trivialfis avatar Jul 18 '22 14:07 trivialfis

@trivialfis @RAMitchell finally got the CI to pass. This is more or less the complete new unified api. I haven't changed the callers yet. Would you mind taking a quick look to see if it makes sense at a high level? Thanks!

rongou avatar Aug 17 '22 21:08 rongou

Thank you for the great work here! I will dive into the PR tomorrow.

trivialfis avatar Aug 18 '22 17:08 trivialfis

@trivialfis switched to use JSON to configure the communicators. Please take another look. Thanks!

Next I'll try to switch to the dmlc registry.

rongou avatar Aug 30 '22 23:08 rongou

Next I'll try to switch to the dmlc registry.

I don't have a strong preference for this. Feel free to create your own factory methods if needed. I just wanted to avoid code duplication due to GPU support.

trivialfis avatar Sep 01 '22 08:09 trivialfis

@trivialfis got rid of the factories. Please take another look. Should we try to merge this? So far this is a non-breaking change. Once I start modifying the callers, it'd probably become a breaking change.

rongou avatar Sep 02 '22 00:09 rongou

@trivialfis can this be merged? Is there anything else you want me to change? Thanks!

rongou avatar Sep 08 '22 16:09 rongou

The main goal is to include federated learning in a release so users can just pip install xgboost to get the functionality. A few steps left to do:

  • Add the JNI wrapper for the communicator api
  • Change all the callers to use communicator instead of rabit
  • Change CI to build with federated learning enabled by default
  • Release with federated learning included

Once this is done, we can probably do more with Rabit (rewrite, swap out), unify host/device apis.

For federated learning, we may look at things like homomorphic encryption, differential privacy, and vertical federated learning (features are split between participants).

rongou avatar Sep 09 '22 21:09 rongou

Restarted jenkins linux.

trivialfis avatar Sep 12 '22 20:09 trivialfis

homomorphic encryption, differential privacy

The main goal is to include federated learning in a release so users can just pip install xgboost to get the functionality. A few steps left to do:

  • Add the JNI wrapper for the communicator api
  • Change all the callers to use communicator instead of rabit
  • Change CI to build with federated learning enabled by default
  • Release with federated learning included

Once this is done, we can probably do more with Rabit (rewrite, swap out), unify host/device apis.

For federated learning, we may look at things like homomorphic encryption, differential privacy, and vertical federated learning (features are split between participants).

do we have homomorphic encryption or differential privacy in the plugin now?

lidh15 avatar Jan 15 '24 07:01 lidh15