Support Serverless orchestration and Async strategies
Describe the type of feature and its functionality.
We have used flwr for a large scale (100s of TB) medical imaging use case. Thank you for this great library which made our life much easier.
When operating thousands of long running experiments, we faced a few pain points, mainly:
- Managing multiple servers for individual experiments became tedious, fragile and unsustainable.
- Different institutions (clients) have very different data and compute properties that make the training speed and time very uneven and instable, to a point that the synchronization became a bottleneck.
To scratch our own itch, we implemented flwr_serverless as a wrapper of flwr for both Sync and Async strategies. It allows federated training to run without a central server that aggregates models. The core federation functionality is passed through to flwr strategies. We summarized our learnings on public data in this tech report. With the added robustness due to serverless+async, this implementation addressed our pain points and allowed us to do large scale experimentation using flwr FL for the past year. We think other teams may also find this feature useful. Feedback and critique are welcome.
PS: I should probably have submitted feature request a year ago, but better late than never to contribute upstream. I also noticed related work on Async, which seems to have increasing need for practical deployments.
best, ZZ AT kungfu.ai
Describe step by step what files and adjustments are you planning to include.
We implemented SyncFederatedNode and AsyncFederatedNode to handle commutation of model weights to a shared "Folder" (e.g. S3). And for tensorflow/keras, we implemented a FlwrFederatedCallback that is easy to plug into the user's training code. This callback holds the federated node, which in turn manages model federation. We haven't implemented torch integration but it could be similar.
An example usage:
# Create a FL Node that has a strategy and a shared folder.
from flwr.server.strategy import FedAvg # This is a flwr federated strategy.
from flwr_serverless import AsyncFederatedNode, S3Folder
from flwr_serverless.keras import FlwrFederatedCallback
strategy = FedAvg()
shared_folder = S3Folder(directory="mybucket/experiment1")
node = AsyncFederatedNode(strategy=strategy, shared_folder=shared_folder)
# Create a keras Callback with the FL node.
num_examples_per_epoch = steps_per_epoch * batch_size # number of examples used in each epoch
callback = FlwrFederatedCallback(
node,
num_examples_per_epoch=num_examples_per_epoch,
save_model_before_aggregation=False,
save_model_after_aggregation=False,
)
# Join the federated learning, by fitting the model with the federated callback.
model = keras.Model(...)
model.compile(...)
model.fit(dataset, callbacks=[callback])
Is there something else you want to add?
No response
Thanks @zzsi - I am not part of Flower team, but I have been looking for something similar in terms of having a serverless implementation. I would be interested in understanding whether it makes sense to create an architecture proposal around this for Flower team, and a PR.
Additionally, instead of having a central storage capability I was hoping to have a peer-to-peer system and I believe you've comments in your code about it. For that, clients could gossip updates on their model wieghts to each other I would imagine.
Will comment more on your repo, commenting here since your repo has not had any activity in the past 8 months.
Hi @leeloodub, apologies for my delayed reply. P2p is also a great option we considered but opted for a central storage for its simplicity. Doing an architecture proposal and PR sounds good to me. Do you have a use case in mind that motivates the implementation?
Hi @zzsi,
this would be a great contribution to the framework. Would you be interested in opening a PR? We can support you throughout the process.
@WilliamLindskog yes, will do, appreciate it!
@zzsi Great! Looking forward to it. Let me know if I can help in any other way.
Hi @zzsi, just checking in here. Did you open a PR?
@WilliamLindskog I have some work in progress, half done, sorry for the delay.
We only had a tensorflow example and I would like to add a torch quickstart. I will put more time on it next week.
Okay great! Just thought I would check in. Lmk if we can help.
@WilliamLindskog I am looking into simulating a serverless federated run. The current tool seems to assume client/server. Do you have recommendations or a particular way to simulation you prefer?
Good question! @zzsi
There is no particular way to implement this I think. Since Flower usually expects a clientApp and serverApp to be passed, maybe we can pass an empty serverApp or dynamically "change" location of the server component where the aggregation happens.
Interested in hearing your thoughts.
Hey all! I subscribed to the thread because I am also doing research in async FL. Thought about giving my 2 cents on this.
@WilliamLindskog In the entire Serverless FL setup, the controller acts as the only static node and keeps a client registry as well. It sounds similar to what a DNS does in a network (or it can do what a DNS does). It would make more sense to wrap the ServerApp around the controller. Of course, aggregation will take place elsewhere. But we could essentially reduce the existing ServerApp to simply a client registry and (maybe or maybe not) orchestration tool. The best thing is that it also doesn't require us to modify how the ClientApp treats this new "Server" App.
Thanks @WilliamLindskog and @Spaarsh for the great suggestions. I agree an empty ServerApp or a simple ServerApp as a client registry would be a good way to simulate. I created a PR for you to review: https://github.com/adap/flower/pull/5189. I used my previous standalone simulation script (please take a look at examples/serverless-pytorch), and will add a ServerApp implementation. My apology for the delay (been traveling).
@WilliamLindskog and @Spaarsh I added an empty server and a client that does serverless federation in examples/serverless-pytorch. In that directory, when I do flwr run ., the client app does not seem to have access to cuda devices, even when python -c 'import torch; print(torch.cuda.is_available())' returns 'True'. I wonder if you would see the same behavior.
Hi @zzsi, really cool PR. I tried running it using run_simulation and flwr run ., seems like it is getting stuck after round 1 in the latter.
Also, I saw that using run_simulation, the results are quite good for client accuracies but test accuracy is relatively low, is that expected?
regarding cpu/cuda, I did get some warnings and it defaulted to cpu, need to look in to it further.
@WilliamLindskog thanks for trying it out. For the serverless-pytorch/ example, python run_simulation.py will simulate the serverless training for 20 epochs. And yes, test accuracy is expected to be roughly 75%, significantly lower than each client's training accuracy. Using more epochs, and changing the "skew_factor" to a smaller number (in run_simulation.py) would increase the test accuracy. I think you can get to 93%~94%.
I haven't figured out cuda in flwr run.
I see, I got 10% in accuracy, might have run something wrong. Will try again tomorrow and get back to you asap.
Hi @zzsi, is there a reason why the NNs are built on top of Keras? I've had some issues sometimes when deploying it on GPUs, could be that it causes the run to default to cpu, haven't looked into it too much though.
@WilliamLindskog currently both torch and tensorflow are supported. As to the examples, serverless-pytorch/ is more optimized. python run_simulation.py will use torch.nn Resnet.
I realize that my TorchFederatedLearningRunner class is confusing. It has a base class that assume keras models. I will refactor that.
The torch example still uses tensorflow to load the cifar10 dataset. Pushed a change to clean up imports and dataset loading, to avoid accidentally importing tensorflow.
Hi @zzsi,
I took a look again at the PR. It's great work and a lot of code to digest. Also, I'm not sure if it follows the current Flower client <-> Server set-up (recognizing that this example would be different).
Could you try to reduce the amount of code for the linked PR, focusing only on the essential part of the new code, making it as lightweight as possible? We are reviewing it currently.
Best regards William
@WilliamLindskog I am happy to. ETA mid next week.
@WilliamLindskog The PR is trimmed to include a much smaller change.
The entrypoint to run a simulation is in: run_simulation.py
The main files to look at:
Hi @WilliamLindskog, just checking if you have time to take a took at the updated PR (https://github.com/adap/flower/pull/5189)? Thanks!