flower icon indicating copy to clipboard operation
flower copied to clipboard

Support Serverless orchestration and Async strategies

Open zzsi opened this issue 1 year ago • 22 comments

Describe the type of feature and its functionality.

We have used flwr for a large scale (100s of TB) medical imaging use case. Thank you for this great library which made our life much easier.

When operating thousands of long running experiments, we faced a few pain points, mainly:

  • Managing multiple servers for individual experiments became tedious, fragile and unsustainable.
  • Different institutions (clients) have very different data and compute properties that make the training speed and time very uneven and instable, to a point that the synchronization became a bottleneck.

To scratch our own itch, we implemented flwr_serverless as a wrapper of flwr for both Sync and Async strategies. It allows federated training to run without a central server that aggregates models. The core federation functionality is passed through to flwr strategies. We summarized our learnings on public data in this tech report. With the added robustness due to serverless+async, this implementation addressed our pain points and allowed us to do large scale experimentation using flwr FL for the past year. We think other teams may also find this feature useful. Feedback and critique are welcome.

PS: I should probably have submitted feature request a year ago, but better late than never to contribute upstream. I also noticed related work on Async, which seems to have increasing need for practical deployments.

best, ZZ AT kungfu.ai

Describe step by step what files and adjustments are you planning to include.

We implemented SyncFederatedNode and AsyncFederatedNode to handle commutation of model weights to a shared "Folder" (e.g. S3). And for tensorflow/keras, we implemented a FlwrFederatedCallback that is easy to plug into the user's training code. This callback holds the federated node, which in turn manages model federation. We haven't implemented torch integration but it could be similar.

An example usage:

# Create a FL Node that has a strategy and a shared folder.
from flwr.server.strategy import FedAvg  # This is a flwr federated strategy.
from flwr_serverless import AsyncFederatedNode, S3Folder
from flwr_serverless.keras import FlwrFederatedCallback

strategy = FedAvg()
shared_folder = S3Folder(directory="mybucket/experiment1")
node = AsyncFederatedNode(strategy=strategy, shared_folder=shared_folder)

# Create a keras Callback with the FL node.
num_examples_per_epoch = steps_per_epoch * batch_size # number of examples used in each epoch
callback = FlwrFederatedCallback(
    node,
    num_examples_per_epoch=num_examples_per_epoch,
    save_model_before_aggregation=False,
    save_model_after_aggregation=False,
)

# Join the federated learning, by fitting the model with the federated callback.
model = keras.Model(...)
model.compile(...)
model.fit(dataset, callbacks=[callback])

Is there something else you want to add?

No response

zzsi avatar Sep 30 '24 18:09 zzsi

Thanks @zzsi - I am not part of Flower team, but I have been looking for something similar in terms of having a serverless implementation. I would be interested in understanding whether it makes sense to create an architecture proposal around this for Flower team, and a PR.

Additionally, instead of having a central storage capability I was hoping to have a peer-to-peer system and I believe you've comments in your code about it. For that, clients could gossip updates on their model wieghts to each other I would imagine.

Will comment more on your repo, commenting here since your repo has not had any activity in the past 8 months.

leeloodub avatar Nov 21 '24 18:11 leeloodub

Hi @leeloodub, apologies for my delayed reply. P2p is also a great option we considered but opted for a central storage for its simplicity. Doing an architecture proposal and PR sounds good to me. Do you have a use case in mind that motivates the implementation?

zzsi avatar Nov 29 '24 04:11 zzsi

Hi @zzsi,

this would be a great contribution to the framework. Would you be interested in opening a PR? We can support you throughout the process.

WilliamLindskog avatar Jan 28 '25 21:01 WilliamLindskog

@WilliamLindskog yes, will do, appreciate it!

zzsi avatar Jan 31 '25 17:01 zzsi

@zzsi Great! Looking forward to it. Let me know if I can help in any other way.

WilliamLindskog avatar Jan 31 '25 18:01 WilliamLindskog

Hi @zzsi, just checking in here. Did you open a PR?

WilliamLindskog avatar Mar 07 '25 14:03 WilliamLindskog

@WilliamLindskog I have some work in progress, half done, sorry for the delay.

We only had a tensorflow example and I would like to add a torch quickstart. I will put more time on it next week.

zzsi avatar Mar 07 '25 15:03 zzsi

Okay great! Just thought I would check in. Lmk if we can help.

WilliamLindskog avatar Mar 07 '25 17:03 WilliamLindskog

@WilliamLindskog I am looking into simulating a serverless federated run. The current tool seems to assume client/server. Do you have recommendations or a particular way to simulation you prefer?

zzsi avatar Mar 07 '25 18:03 zzsi

Good question! @zzsi

There is no particular way to implement this I think. Since Flower usually expects a clientApp and serverApp to be passed, maybe we can pass an empty serverApp or dynamically "change" location of the server component where the aggregation happens.

Interested in hearing your thoughts.

WilliamLindskog avatar Mar 12 '25 16:03 WilliamLindskog

Hey all! I subscribed to the thread because I am also doing research in async FL. Thought about giving my 2 cents on this.

@WilliamLindskog In the entire Serverless FL setup, the controller acts as the only static node and keeps a client registry as well. It sounds similar to what a DNS does in a network (or it can do what a DNS does). It would make more sense to wrap the ServerApp around the controller. Of course, aggregation will take place elsewhere. But we could essentially reduce the existing ServerApp to simply a client registry and (maybe or maybe not) orchestration tool. The best thing is that it also doesn't require us to modify how the ClientApp treats this new "Server" App.

Spaarsh avatar Mar 12 '25 17:03 Spaarsh

Thanks @WilliamLindskog and @Spaarsh for the great suggestions. I agree an empty ServerApp or a simple ServerApp as a client registry would be a good way to simulate. I created a PR for you to review: https://github.com/adap/flower/pull/5189. I used my previous standalone simulation script (please take a look at examples/serverless-pytorch), and will add a ServerApp implementation. My apology for the delay (been traveling).

zzsi avatar Apr 02 '25 15:04 zzsi

@WilliamLindskog and @Spaarsh I added an empty server and a client that does serverless federation in examples/serverless-pytorch. In that directory, when I do flwr run ., the client app does not seem to have access to cuda devices, even when python -c 'import torch; print(torch.cuda.is_available())' returns 'True'. I wonder if you would see the same behavior.

zzsi avatar Apr 11 '25 21:04 zzsi

Hi @zzsi, really cool PR. I tried running it using run_simulation and flwr run ., seems like it is getting stuck after round 1 in the latter.

Also, I saw that using run_simulation, the results are quite good for client accuracies but test accuracy is relatively low, is that expected?

regarding cpu/cuda, I did get some warnings and it defaulted to cpu, need to look in to it further.

WilliamLindskog avatar Apr 28 '25 20:04 WilliamLindskog

@WilliamLindskog thanks for trying it out. For the serverless-pytorch/ example, python run_simulation.py will simulate the serverless training for 20 epochs. And yes, test accuracy is expected to be roughly 75%, significantly lower than each client's training accuracy. Using more epochs, and changing the "skew_factor" to a smaller number (in run_simulation.py) would increase the test accuracy. I think you can get to 93%~94%.

I haven't figured out cuda in flwr run.

zzsi avatar Apr 29 '25 03:04 zzsi

I see, I got 10% in accuracy, might have run something wrong. Will try again tomorrow and get back to you asap.

WilliamLindskog avatar Apr 29 '25 05:04 WilliamLindskog

Hi @zzsi, is there a reason why the NNs are built on top of Keras? I've had some issues sometimes when deploying it on GPUs, could be that it causes the run to default to cpu, haven't looked into it too much though.

WilliamLindskog avatar Apr 29 '25 12:04 WilliamLindskog

@WilliamLindskog currently both torch and tensorflow are supported. As to the examples, serverless-pytorch/ is more optimized. python run_simulation.py will use torch.nn Resnet.

I realize that my TorchFederatedLearningRunner class is confusing. It has a base class that assume keras models. I will refactor that.

zzsi avatar Apr 29 '25 20:04 zzsi

The torch example still uses tensorflow to load the cifar10 dataset. Pushed a change to clean up imports and dataset loading, to avoid accidentally importing tensorflow.

zzsi avatar Apr 29 '25 21:04 zzsi

Hi @zzsi,

I took a look again at the PR. It's great work and a lot of code to digest. Also, I'm not sure if it follows the current Flower client <-> Server set-up (recognizing that this example would be different).

Could you try to reduce the amount of code for the linked PR, focusing only on the essential part of the new code, making it as lightweight as possible? We are reviewing it currently.

Best regards William

WilliamLindskog avatar May 20 '25 13:05 WilliamLindskog

@WilliamLindskog I am happy to. ETA mid next week.

zzsi avatar May 22 '25 20:05 zzsi

@WilliamLindskog The PR is trimmed to include a much smaller change.

The entrypoint to run a simulation is in: run_simulation.py

The main files to look at:

AsyncFederatedNode

ExperimentRunner

zzsi avatar May 28 '25 17:05 zzsi

Hi @WilliamLindskog, just checking if you have time to take a took at the updated PR (https://github.com/adap/flower/pull/5189)? Thanks!

zzsi avatar Jul 01 '25 14:07 zzsi