flower icon indicating copy to clipboard operation
flower copied to clipboard

Flower server sending model to android client

Open Sorna-Meena opened this issue 2 years ago • 9 comments

Hi @danieljanes,

I am using flower on android and I noticed that the flower server doesn't send the android clients the model instead the model is pre-built on the client device. How to resolve this?

Also, Currently, after the client finishes training 'n' number of rounds and returns the losses and metrics to server, the terminal running the server script terminates. The next time we run the server script, I am not sure whether the clients get the updated model weights from the previous FL training. Why is this happening and how to solve this?

Sorna-Meena avatar Dec 15 '21 09:12 Sorna-Meena

Hi,

I want to use flower on android as well. Are you using the app provided by flower? If so, could you please tell me how you install the app? And maybe the type of your mobile as well. I tried installing it on 3 different Android mobiles, and none of them worked.... :(

Victoria-Wei avatar Dec 15 '21 13:12 Victoria-Wei

Hi @Victoria-Wei ,

I followed the steps as mentioned in the repo: https://github.com/adap/flower/tree/main/examples/android. You can install the app from the link https://www.dropbox.com/s/e14t3e9py3mr73v/flwr_android_client.apk?dl=1. This link mentioned in the download_apk.sh file too.

Try installing the app in a mobile that has android version>=11.

Hope that works!!

Sorna-Meena avatar Dec 16 '21 04:12 Sorna-Meena

Dear @Sorna-Meena,

Thank you for your reply!!! It might be the problem of the android version since I did the exact same thing as you mentioned, except that my mobiles have android version < 10. I'll look for emulator to install the app then.

Thank you again!!!!!!!

Victoria-Wei avatar Dec 16 '21 05:12 Victoria-Wei

Certainly the model is loaded in the Android client side as shown in the next source code portion of examples/android/client/app/src/main/java/flwr/android_client/TransferLearningModelWrapper.java:

    TransferLearningModelWrapper(Context context) {
        model =
                new TransferLearningModel(
                        new AssetModelLoader(context, "model"),
                        Arrays.asList("cat", "dog", "truck", "bird",
                                "airplane", "ship", "frog", "horse", "deer",
                                "automobile"));

It would be good to create a new server-client application that would be able to accept new models from the server and load them through the network instead. How could we achieve this? I have some idea of what could be done in an easy way that it should not represent a huge source code change.

The main idea would be to load the model that is offered by the server in the shape of an URL when the client asks for it. Currently, the model is loaded by using the AssetModelLoader, which is defined as next in examples/android/client/transfer_api/src/main/java/org/tensorflow/lite/examples/transfer/api/AssetModelLoader.java:

  public AssetModelLoader(Context context, String directoryName) {
    this.directoryName = directoryName;
    this.assetManager = context.getAssets();
  }

This states that "model" is the directory name for our client call. The model used is not in the repository, its position is defined in examples/android/client/app/build.gradle build configuration:

def modelUrl = 'https://www.dropbox.com/s/tubgpepk2q6xiny/models.zip?dl=1'
def modelArchivePath = "${buildDir}/model.zip"
def modelTargetLocation = 'src/main/assets/model'

This model is accompanied by its training data as well.

If we would want to use another model no changes are required for the strategy file src/py/flwr/server/strategy/fedavg_android.py but the next steps would be required for the example android project:

  1. TransferLearningModelWrapper initialization function source code (TransferLearningModel array declaration) and IMAGE_SIZE should be modified; if input image size is not square some other modifications might be required.
  2. modelUrl and dataUrl should be modified for using other model and training dataset content.

sisco0 avatar Dec 20 '21 00:12 sisco0

@sisco0 Sorry for the late reply and thank you very much for your detailed response! I now understand how the model is loaded on the client side.

However, I do have a few questions. First, it is still unclear to me why after the federated learning process is over, the server script terminates.

Secondly, after the FL process is over where is the global model stored. Ideally in a FL setup, once a pre-defined criterion is met (here the 'number of rounds'), the server aggerates the updates and finalizes the global model. The next time a federation round starts, the clients use this updated global model for training. How is this implemented in flower?

Lastly, I want to use the updated global model i.e., the model obtained from the previous round of FL when I once again run the server script. How can this be done?

Thank you again!!

Sorna-Meena avatar Jan 03 '22 07:01 Sorna-Meena

On the question about why the federated learning is over after a certain number of rounds? The answer to that is found at the num_rounds configuration parameter for the server; when the number of rounds set is reached we call to disconnect_all_clients() which performs a graceful shutdown by calling to shutdown() and disconnecting all the clients. The disconnect_all_clients() is the last function being executed at _fl(), which is near the end of the start_server() function, so that is the reason why your server shuts down after a fixed number of training rounds is had.

https://github.com/adap/flower/blob/67cb2f37dac076c7ef62ff34145cd2c2545fc310/src/py/flwr/server/app.py#L132 https://github.com/adap/flower/blob/67cb2f37dac076c7ef62ff34145cd2c2545fc310/src/py/flwr/server/server.py#L287-L290

How could we get an infinite number of rounds so our server is always learning? We should look into the server.fit() function. In this function we have the loop that is running until the number of rounds set condition is met. Currently, there is no way to do the infinite loop in here, as we are using a current_round range-based for loop in our server.fit() function. Maybe we could modify this loop to be a while true or setting a new configuration option for cases where num_rounds == -1, but this is not currently implemented (it is an easy source code modification, but it is not implemented). If you are going to implement it fast you could use and endless loop but I would consider to use some other kind of flag-based approach for gently shutting down the server in a way that we are not just pressing Ctrl+C in the middle of a weights saving process. You could use signal.SIGINT and create a signal handler function for sure, or maybe just KeyboardInterrupt exception, which possibly catches when SIGINT appears (maybe you should attach the signal hook somehow).

https://github.com/adap/flower/blob/67cb2f37dac076c7ef62ff34145cd2c2545fc310/src/py/flwr/server/app.py#L108 https://github.com/adap/flower/blob/67cb2f37dac076c7ef62ff34145cd2c2545fc310/src/py/flwr/server/server.py#L136

On the question about where is my trained model during and after the training process? At the end of each training round, we are storing our parameters into self.parameters at the Server class instance, which are grabbed from the res_fit variable, which is produced by the self.fit_round() call. We could store models during training, that process is normally referred as storing model checkpoints and this could be a function defined in our server strategy; for example, we call savez at the SaveModelStrategy for each round that has been run. You could use this savez function for any modified strategy that you want. This is also performed at the end of our last training round, so we store it at the end as well. For sure you could replace the already present data file for each new round.

https://github.com/adap/flower/blob/67cb2f37dac076c7ef62ff34145cd2c2545fc310/src/py/flwr/server/server.py#L137-L142 https://github.com/adap/flower/blob/67cb2f37dac076c7ef62ff34145cd2c2545fc310/src/py/flwr_example/pytorch_save_weights/server.py#L30-L42

Then, how do I start with my last shiny trained model that I stored at checkpoints? Just set the initial_parameters parameter using weight_to_parameters after loading up your weights file (which you stored previously using savez). An example for Tensorflow is attached below.

https://github.com/adap/flower/blob/67cb2f37dac076c7ef62ff34145cd2c2545fc310/examples/advanced_tensorflow/server.py#L8-L27

sisco0 avatar Jan 03 '22 23:01 sisco0

@sisco0 Thank you very much for your detailed explanation!! It was very helpful for me to understand.

But I did notice something in the saving model using SaveModelStrategy method you mentioned. https://github.com/adap/flower/blob/67cb2f37dac076c7ef62ff34145cd2c2545fc310/src/py/flwr_example/pytorch_save_weights/server.py#L30-L42

In the line 37, does the super().aggregate_fit() function actually return the aggregated parameters and not weights. Because I keep encountering the following error. Error:

Traceback (most recent call last):
  File "flower/examples/android/server.py", line 118, in <module>
    main()
  File "flower/examples/android/server.py", line 94, in main
    initial_parameters=fl.common.weights_to_parameters(weights),
  File "C:\Users\xxxxxx\miniconda3\envs\flower\lib\site-packages\flwr\common\parameter.py", line 28, in weights_to_parameters
    tensors = [ndarray_to_bytes(ndarray) for ndarray in weights]
TypeError: iteration over a 0-d array

Process finished with exit code 1

After I changed the lines 37-42 as follows I didn't encounter the above error but the fitting_round fails for all clients and the server output either freezes after the fit_round (shown in Output below) or returns the weights to be None which in turn makes it unable to save the model weights.

Code:

aggregated_parameters_tuple = super().aggregate_fit(rnd, results, failures)
aggregated_parameters, _ = aggregated_parameters_tuple
if aggregated_parameters is not None:
      print(f"Saving round {rnd} aggregated_weights..")
      # Convert `Parameters` to `List[np.ndarray]`
       aggregated_weights: List[np.ndarray] = fl.common.parameters_to_weights(aggregated_parameters)
       np.savez(f"round-{rnd}-weights.npz", aggregated_weights) 
return aggregated_weights 

Output:

INFO flower 2022-01-07 18:19:05,236 | server.py:106 | Initializing global parameters
INFO flower 2022-01-07 18:19:05,236 | server.py:290 | Using initial parameters provided by strategy
INFO flower 2022-01-07 18:19:05,236 | server.py:109 | Evaluating initial parameters
INFO flower 2022-01-07 18:19:05,237 | server.py:122 | FL starting
DEBUG flower 2022-01-07 18:21:21,477 | server.py:241 | fit_round: strategy sampled 2 clients (out of 2)
DEBUG flower 2022-01-07 18:21:22,048 | server.py:250 | fit_round received 0 results and 2 failures
Saving round 1 aggregated_weights...
DEBUG flower 2022-01-07 18:22:18,744 | server.py:190 | evaluate_round: strategy sampled 2 clients (out of 2)

Even when I try to reconnect the clients back to the server, my server output freezes in the evaluate_round line i.e., last line in the output above and there is also no response from the app (the client side). Please note that I have not set any eval_fn in my Strategy initialization. Can the failure of rounds be due to that? If that is the case, what should I initialize the model as in eval_fn for the android client example of flower?

Also, I would like to know how are the global model parameters being sent to the clients (app) and how can the model be saved in the client side (android device/android emulator).

Thank you once again!

Sorna-Meena avatar Jan 07 '22 13:01 Sorna-Meena

As you wisely pointed out, we are really returning parameters from the FedAvg (which is the super class for this strategy) aggregate_fit() function as it could be seen in src/py/flwr/server/strategy/fedavg.py:257. Then, we should fix this behavior just by adding a parameters_to_weights() call in src/py/flwr_example/pytorch_save_weights/server.py:37. That would solve the savez issue that you are having at this moment, as we were passing parameters instead of weights. This error was introduced by commit 79bcf952 (2021-05-09), which was after the example was created by commit 3f06d544 (2020-12-03) and deprecation messages on using weights started to show up. The real root cause of this error in this example is that we are not setting any poetry project under it, so no pyproject.toml file fixes the flower version where the example should run on (which should be any version before 2021-05-09), this is normally accomplished by creating a poetry project as it could be seen on other examples.

You have two alternatives here to fix this error:

  1. Install a Flower version which currently defines aggregate_fit() as a function that returns weights (and does not return parameters). Possibly creating a pyproject.toml file for maintaining this example and contributing to Flower repository.
  2. Update the example strategy to convert the received parameters into weights before calling the savez() function so it works with the latest Flower release and create the pyproject.toml file with a README.md file containing instructions on how to go on with this example (This is my recommended option).

By the way, are the new examples expected to live under examples or under src/py/flwr_example folder @tanertopal @danieljanes ?

I attach the related source code portions for you to understand better. https://github.com/adap/flower/blob/1c933184e2fd6ad0476064678b8966a2f3728624/src/py/flwr/server/strategy/fedavg.py#L240-L257 https://github.com/adap/flower/blob/1c933184e2fd6ad0476064678b8966a2f3728624/src/py/flwr_example/pytorch_save_weights/server.py#L30-L42

The next list of questions need to be answered:

  1. How are the global model parameters being sent to the clients?
  2. How can the model be saved in the client side?

sisco0 avatar Jan 07 '22 15:01 sisco0

@sisco0 Thank you for your immediate response!

As you suggested, I have tried out your recommended option to fix the error. But I get the following error:

Error

DEBUG flower 2022-01-10 16:06:46,573 | server.py:241 | fit_round: strategy sampled 2 clients (out of 2)
DEBUG flower 2022-01-10 16:07:08,440 | server.py:250 | fit_round received 2 results and 0 failures
Traceback (most recent call last):
  File "C:\Users\xxxxxx\miniconda3\envs\flower\lib\site-packages\numpy\lib\npyio.py", line 444, in load
    raise ValueError("Cannot load file containing pickled data "
ValueError: Cannot load file containing pickled data when allow_pickle=False

Process finished with exit code -1

Instead of converting parameters to weights using fl.coomon.paramters_to_weights function I used the following below code to avoid the valueError.

Code

 class SaveModelStrategy(fl.server.strategy.FedAvg): 
     def aggregate_fit( self, rnd: int, results: List[Tuple[fl.server.client_proxy.ClientProxy, fl.common.FitRes]], 
         failures: List[BaseException],  ) -> Optional[fl.common.Weights]: 
              aggregated_parameters_tuple = super().aggregate_fit(rnd, results, failures)
             aggregated_parameters, _ = aggregated_parameters_tuple
             if aggregated_parameters is not None:
                   # Save aggregated_weights
                    weights_list = [np.frombuffer(tensor) for tensor in aggregated_parameters.tensors]
                    print(f"Saving round {rnd} aggregated_weights...")
                    np.savez(f"round-{rnd}-weights.npz",weights_list)
          return aggregated_parameters_tuple

Though my error is fixed , the training always fails for all clients during the fit_round and and the server output either freezes after the fit_round or returns the weights to be None which in turn makes it unable to save the model weights. Even when I try to reconnect the clients back to the server, my server output freezes in the evaluate_round and there is also no response from the app (the client side). How can this be fixed?

Thanks in advance!

Sorna-Meena avatar Jan 09 '22 03:01 Sorna-Meena