flower icon indicating copy to clipboard operation
flower copied to clipboard

venv placement leads to weird server behavior

Open oabuhamdan opened this issue 8 months ago • 14 comments

Describe the bug

Hello, I had this issue, and I found the reason behind it after rounds of debugging (10 days of debugging), but now I need explanation, or it has to be fixed.

The figure below shows the network data exchange for a FL setup with 1 client and 1 server, training for 1 round. I use Mininet in my setup, so I was able to collect some network data.

In a proper network exchange, only the model weights are sent and received, which are 17MB in my case (mobilenet_v3_large). In the figure, the server sends ~400MB, and the client receives 17MB.

I collected the network packets with tcpdump, searching for any possible flaw. Looking into the packets' payloads, I noticed that a lot of python code is there! It turned out that the whole venv directory is being transferred in every single round.

Here is my directory tree before I fix the issue (main.py is to run the Mininet code, it doesn't affect the FL setup.)

FederatedLearning/
├── data
│   └── cifar10
├── experimentscode
│   ├── data -> /home/osama/FederatedLearning/data/
│   ├── fedlearning
│   │   ├── FlowerClient.py
│   │   ├── FlowerServer.py
│   │   ├── __init__.py
│   │   └── task.py
│   ├── logs -> /home/osama/FederatedLearning/logs/
│   ├── main.py
│   ├── prepare_dataset.py
│   ├── pyproject.toml
│   ├── requirements.txt
│   └── venv
└── logs

And here is the only change I made to fix the issue. No changes at all to the code or any other script. I just mode venv to outside the code directory.

FederatedLearning/
├── data
│   └── cifar10
├── experimentscode
│   ├── data -> /home/osama/FederatedLearning/data/
│   ├── fedlearning
│   │   ├── FlowerClient.py
│   │   ├── FlowerServer.py
│   │   ├── __init__.py
│   │   └── task.py
│   ├── logs -> /home/osama/FederatedLearning/logs/
│   ├── main.py
│   ├── prepare_dataset.py
│   ├── pyproject.toml
│   └── requirements.txt
├── logs
└── venv

Do you have any explanation why the server behaves like this? Best Image

Steps/Code to Reproduce

Use any code, I tried the code in your PyTorch quick start, and it behaved the same.

Expected Results

Send and receive mode params only, not the whole venv dir!

Actual Results

Venv is being transferred.

oabuhamdan avatar Apr 22 '25 13:04 oabuhamdan

Hi @oabuhamdan , thanks for opening the issue. I have a suspicion, but could you please let me know which version of Flower you're using?

danieljanes avatar Apr 22 '25 13:04 danieljanes

Hi @danieljanes I use 1.16.0 I updated to 1.17, but I didn't like the "Add node availability check to reduce wait time (https://github.com/adap/flower/pull/4968)" so I rolled back to 1.16.0.

oabuhamdan avatar Apr 22 '25 13:04 oabuhamdan

Oh interesting, could you elaborate on your perspective on the node availability check? It was a feature/change we discussed quite a bit internally, I'd love to hear your thoughts.

danieljanes avatar Apr 22 '25 13:04 danieljanes

Daniel, I used Flower 1.17 for a few hours only, so I might not be accurate describing the issue, since I didn't debug the code enough. Anyway, I am doing my PhD in CS. I work in the networking side of FL, this is why I use Mininet, to emulate a real network. Before 1.17, if the network is congested, the client will be available, even if it causes delay. My research is to find a solution to this delay. In your 1.17 update, you just through an error for such clients, which doesn't fit my research needs. I need all clients to stay connected, even if they cause delay.

Does the 1.17 update solve this weird issue I am describing in the ticket here? If yes, can you tell me how it is fixed, and why the issue is happening? If no, can you tell me why the issue is happening?

Best

oabuhamdan avatar Apr 22 '25 13:04 oabuhamdan

Thanks for elaborating @oabuhamdan !

My guess is that flwr run bundles venv in the Flower App Bundle (FAB) that gets sent to the SuperLink and the SuperNodes. That would explain the big difference. We'll investigate and get back to you.

danieljanes avatar Apr 23 '25 14:04 danieljanes

Hi @oabuhamdan,

Thanks for raising this and providing your insights. It is really helpful. I just checked and can confirm what @danieljanes said; flwr run bundles venv in the Flower App Bundle (FAB) that gets sent to the SuperLink and the SuperNodes.

When testing with PyTorch Quickstart Example, the FAB is 180MB when including venv and 28KB without.

What changes do you suggest would to reduce the risk of running into this?

WilliamLindskog avatar Apr 23 '25 15:04 WilliamLindskog

Hi guys, Thanks for looking into this.

First, since we decide where the code is in the pyproject.toml, specifically in

[tool.flwr.app.components]
serverapp = "fedlearning.FlowerServer:app"
clientapp = "fedlearning.FlowerClient:app"

Why not bundling the code in felearning package only?

Second, why the server sends the whole bundle every round? And why the client is not receiving it? From your v1.11 release notes, it is not supposed to change the code during the same run between rounds, but between different runs to an already-running federation (SuperLink and SuperNodes). Thus, it's not supposed to ship the whole FAB in every round.

Here is my suggestion, another thing might be introducing a solution similar to git-ignore, where you decide not to be included in your FAB. I believe this is the last thing to think of since this feels more like a bug from the above-mentioned two points.

Best!

oabuhamdan avatar Apr 23 '25 15:04 oabuhamdan

@oabuhamdan one question: does your project have a .gitignore and does it ignore venv? flwr run should ignore files/dirs listed there.

EDIT: https://github.com/adap/flower/blob/4fa7a807e860d1c6f4d3ac849c9f6c6d1990a53f/framework/py/flwr/cli/build.py#L111

danieljanes avatar Apr 23 '25 15:04 danieljanes

does your project have a .gitignore and does it ignore venv

I have .gitignore but it doesn't ignore venv. I use PyCharm and I keep it excluded from the IDE itself. Still, we need to fix things for Flower. Not everyone uses version control in their projects, and, as in my case, we have other ways of ignoring venv. Let's not forget the .git/info/exclude too!

oabuhamdan avatar Apr 23 '25 16:04 oabuhamdan

Using .gitignore is the preferred way of excluding files/directories. But I see your point, we might reconsider this and think about changing the default to something that includes fewer files/dirs.

Re your other point:

Second, why the server sends the whole bundle every round?

This shouldn't be the case, each client should have to download the FAB just once. In the initial post it says that the training runs for one round only, could you test it with two rounds?

danieljanes avatar Apr 24 '25 07:04 danieljanes

This shouldn't be the case, each client should have to download the FAB just once. In the initial post it says that the training runs for one round only, could you test it with two rounds?

Hi, I am not ignoring this, but I am a bit busy with my research. I'll share the results with you once I have it.

oabuhamdan avatar Apr 30 '25 16:04 oabuhamdan

Hi @oabuhamdan,

just checking in to see if there're any updates here?

Best regards William

WilliamLindskog avatar Jun 02 '25 19:06 WilliamLindskog

Hi @oabuhamdan,

This is a friendly follow-up, could you please let us know if you were able/will test this?

Best regards William

WilliamLindskog avatar Jun 11 '25 15:06 WilliamLindskog

Hello,

Here are the results. Data is in MB.

This is when the venv is inside the code directory, notice that it's exchanged in every single round, which causes an explosion of the RX bytes in each client. Image

Here is the regular case

Image

oabuhamdan avatar Jun 20 '25 17:06 oabuhamdan