federated-xgboost icon indicating copy to clipboard operation
federated-xgboost copied to clipboard

Is privacy-preserving computation included in current version of Federated XGBoost?

Open zhangjinyiyi opened this issue 4 years ago • 5 comments

I am trying to figure out how federated learning is implemented.

As far as I understand for now, the federated version xgboost includes gRPC to establish server/client communication between host and remote workers. SSL/TLS is implemented in rabit via library mbedtls to ensure the security of data transfer among host and workers.

However, until now I did not quite get the idea of how federated learning implemented, e.g.

  • How privacy is preserved among host and workers or between workers during AllReduce in rabit?
  • Is MPC or differential privacy or other techniques for privacy-preserving techniques implemented yet?
  • Who is the aggregator? is host the aggregator?

I will continue to read the code to get the global picture of the implementation. But it would also be very helpful if someone could give some explanations on those questions.

zhangjinyiyi avatar Aug 07 '20 12:08 zhangjinyiyi

Hi @zhangjinyiyi,

In our system, the "host" is the aggregator and the workers are the various parties. Each party only communicates with the aggregator -- there's no inter-party communication. This aggregator-party communication consists of 1) summaries of local updates to the global model based on the party's training data, sent by the party to the aggregator; and 2) intermediate (per iteration) global models computed by the aggregator after aggregation of all parties' local updates, sent by the aggregator to each party.

We currently do not support secure aggregation or differential privacy. If you want to run XGBoost with a stronger threat model, please check out another of our projects, Secure XGBoost.

chester-leung avatar Aug 07 '20 20:08 chester-leung

Hi @chester-leung ,

As far as I understand, Secure XGBoost is based on TEE -- encrypted data from each party is uploaded into the secure enclave located on the aggregator, then the model is trained within the enclave without data leakage. Therefore, the various parties don't participate the training except contributing the encrpted version of original data. Is that right? TEE is an interesting technology to deal with data silos, but it is also limited by, e.g., the performance of SGX.

Federated XGBoost can also partilly preserve the privacy due to aggregated histograms -- if we consider aggrerated information does not leak individual privacy. Since the aggregation happens in Rabit by Allreduce, I am thinking of modifying Rabit to support secure aggregation and differential privacy or find another secure AllReduce framework to replace Rabit in XGBoost. Do you think it is feasible or do you have other propositions on how to implement this?

zhangjinyiyi avatar Aug 11 '20 03:08 zhangjinyiyi

Hi @zhangjinyiyi,

You're correct -- Secure XGBoost leverages TEEs to provide privacy guarantees, and parties transfer their encrypted data to a central location for processing.

To add support for secure aggregation and differential privacy, you'd likely have to modify both Rabit and the XGBoost algorithms, e.g. here. While this would require a bit of work, it's definitely feasible.

We're also planning on building a ML algorithm agnostic secure aggregation library as part of MC2 in the near future using TEEs. Our hope is to develop the library as a plugin, enabling users of different ML libraries doing federated learning to simply add our library on top for secure aggregation. Since the aggregation only computes on summaries of updates and not on entire datasets, the current performance limits of TEEs should not be as much of an issue. We also expect that existing TEE offerings will soon improve and increase their memory limits.

If you're interested, we'd be happy to work together on developing this TEE-based secure aggregation library.

chester-leung avatar Aug 11 '20 22:08 chester-leung

Hi @podcastinator,

Thank you again for your informative reply.

I see how to add secure aggregation into the federated xgboost. It could really require a certain amount of work and one should have a clear understanding on the details of algorithms and code implementation of xgboost to make the change.

The scope described above of the secure aggregation library for different ML libraries is really fascinating. I am very interested but I am not sure whether I was able to contribute into the development since I am not an expert on TEE or C++ development. Nevertheless, more details could be helpful for me to see where I could contribute to your interesting project.

zhangjinyiyi avatar Aug 13 '20 02:08 zhangjinyiyi

@zhangjinyiyi, we plan to create a roadmap for the secure aggregation project later this month or early next month. We'll keep you in the loop and update you once we have a more concrete plan. Thanks for your interest!

On another note, do you have plans to contribute secure aggregation to Federated XGBoost? It'd be great to support this feature, and we can provide support along the way for any questions you may have.

chester-leung avatar Aug 13 '20 07:08 chester-leung