feat(framework) Add serverless federated learning support

Open zzsi opened this issue 9 months ago • 0 comments

Issue

Description

This PR introduces serverless federated learning to eliminate the need for a central aggregation server. In large-scale experiments, managing multiple servers became tedious and synchronization across heterogeneous client environments was a significant bottleneck. This feature addresses those challenges by leveraging a shared storage mechanism for model weight communication.

Related issues/PRs

Implements the feature request outlined in #4273.

Proposal

Explanation

This PR adds support for serverless federated learning through the following components:

SyncFederatedNode and AsyncFederatedNode: These new node classes manage the communication of model weights to and from a shared storage (e.g., an S3 bucket). They support both synchronous and asynchronous federation strategies.
For pytorch, my recommendationi s to instrument the training loop using FederatedNode.update_parameters() to apply a federated strategy to update the model weight using other nodes' weights. See examples/serverless-pytorch for an example.

The core federated learning logic continues to be driven by existing Flower strategies (e.g., FedAvg). By using a shared folder for model communication, this implementation significantly improves robustness and scalability for large-scale federated experiments.

Examples

An example is provided for pytorch (examples/serverless-pytorch), as an introductory demonstration using a partitioned CIFAR-10 dataset with artificial skew.

Apr 02 '25 15:04 zzsi