openfl
openfl copied to clipboard
WIP: Testing communication resilience
This PR introduces a bash script that emulates cutting a network connection of one or several FL actors. All the actors run in docker containers and are connected to a docker network, disconnection may be triggered by certain log output in a container's stdout.
There are several experiments, that may be conducted to test Collaborator's tolerance to network breakage: A. cut it off when it is CPU bound, performing some calculations B. cut it off when it is waiting for an Aggregators response C. cut it off when it is sending a message D. cut it off when it is receiving a message
Luckily, Collaborator's gRPC client features exactly 3 RPCs: get_tasks, get_aggregated_tensor, and send_local_task_results.
There is a peculiarity regarding Collaborator's gRPC client implementation: fetching an aggregated model from the collaborator happens on a tensor per tensor basis while a local model is sent in one piece, in a stream. Therefore it would be easier to catch A collaborator sending data than receiving.
How to use:
- You need a clean supported Python virtual environment + upgraded pip + openfl installed.
- Source the script itself with desired parameters, it will create all the required artifacts.
test-federationwill contain logs for all the actors you can analyze.
Please note that our tests still require a lot of intervention to the script, thus treat it as just a base layer. The script accepts the following parameters:
STAGE- integer. 1 - Start by building the base image; 2 - Start by creating a new workspace and dockerizing it; 3 - Certify federation, choose any number of collaborators; 4 - Run the existing imageNUMBER_OF_COLS- integer. You can start as many collaborators as you want, but the script will disconnect only the last one!RECONNECTION_TIMEOUT- integer, seconds. Timeout to connect the last collaborator back to the network.MAXIMUM_DISCONNECTIONS- integer. In case you want to limit the number of disconnections throughout the experiment.CUT_ON_LOG- string. A pattern in logs that will trigger disconnections. In the script, wildcards are used to detect the pattern.TEMPLATE- string. OpenFL template to use.
Returning back to the experiments.
A: cut it off when it is CPU bound, performing some calculations.
We can use the Run 0 epoch of N round log message in the keras_cnn_mnist experiment to trigger disconnection while the collaborator does computations with the command:
bash tests/github/docker_disconnecting_test.sh 1 2 20 1 "Run 0 epoch of 1 round"
It will disconnect the second collaborator when training for the first round starts. 20 seconds of disconnection is enough in this case. The collaborator will be affected once it tries to send the task results. It recovers, seemingly thanks to #465, that resends task results if we get grpc.StatusCode.UNKNOWN error code on the client side. Logs are attached
A_aggregator.log
A_collaborator1.log
A_collaborator2.log
B: cut it off when it is waiting for an Aggregators response
C: cut it off when it is sending a message.
We can use send_local_task_results RPC to drop the connection while sending task results.
Here is the exact command that I used. "Setting stream chunks with size 10" is a debug message inside the RPC, 10 is a number specific to the default keras_cnn_mnist experiment.
bash tests/github/docker_disconnecting_test.sh 4 2 20 2 "Setting stream chunks with size 10"
Logs are attached, TLDR: the disconnected collaborator never recovers after reconnection.
C_aggregator.log
C_collaborator1.log
C_collaborator2.log
@igor-davidyuk this is functional in its current state, correct? Can we merge this for the 1.5 release and continue to make improvements later?