flower icon indicating copy to clipboard operation
flower copied to clipboard

RPC Error: Stream removed

Open jsw-zorro opened this issue 3 years ago • 10 comments

Hi, I am using flower to train my algorithm in a federated way, and it was set as running for 100 epochs. It ran well for the first 10 epochs. However, at the end of the 10th epoch. An error prompted on one node.

  File "/home/fl/miniconda3/envs/pytorch/lib/python3.7/site-packages/flwr/client/app.py", line 115, in start_numpy_client
    grpc_max_message_length=grpc_max_message_length,
  File "/home/jinsw/miniconda3/envs/pytorch/lib/python3.7/site-packages/flwr/client/app.py", line 64, in start_client
    server_message = receive()
  File "/home/fl/miniconda3/envs/pytorch/lib/python3.7/site-packages/flwr/client/grpc_client/connection.py", line 60, in <lambda>
    receive: Callable[[], ServerMessage] = lambda: next(server_message_iterator)
  File "/home/fl/miniconda3/envs/pytorch/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/home/fl/miniconda3/envs/pytorch/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Stream removed"
	debug_error_string = "{"created":"@1618418527.047689601","description":"Error received from peer ipv6:[::]:8080","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"Stream removed","grpc_status":2}"
>

The training is set up in the same machine with multiple GPUs. I am using 0.15.0 version of flwr along with PyTorch 1.8.1+cu102.

Thanks for all the help.

jsw-zorro avatar Apr 14 '21 16:04 jsw-zorro

Hi @jsw-zorro , this looks like the server is not running anymore - are there any additional details that you could share?

danieljanes avatar Apr 16 '21 08:04 danieljanes

Hi @danieljanes , what kind of other information do you need? I think it is not the fl_server's problem. When I rerun the client, it can still communicate to the server and continue the federated training procedure.

jsw-zorro avatar Apr 19 '21 06:04 jsw-zorro

The best case would be a repo that we can use to reproduce. Is your code open source or could you provide a shortened version focussed on the issue at hand?

danieljanes avatar Apr 19 '21 18:04 danieljanes

I have the same issue. When I simulation in one machine with multiple GPUs:

File "/homec/x/miniconda3/envs/pt/lib/python3.8/site-packages/grpc/_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Stream removed"
	debug_error_string = "{"created":"@1622163497.974853980","description":"Error received from peer ipv6:[::1]:27002","file":"src/core/lib/surface/call.cc","file_line":1066,"grpc_message":"Stream removed","grpc_status":2}"
>

and server report there is one failure:

DEBUG flower 2021-05-28 08:58:17,284 | server.py:264 | fit_round: strategy sampled 3 clients (out of 3)
DEBUG flower 2021-05-28 08:59:14,405 | server.py:273 | fit_round received 2 results and 1 failures

All other client will get stuck when the client running in GPU6 exited abnormally:

[5] GeForce RTX 2080 Ti | 32'C,   0 % |  7614 / 11019 MB | x(7611M)
[6] GeForce RTX 2080 Ti | 37'C,   0 % |     0 / 11019 MB |
[7] GeForce RTX 2080 Ti | 29'C,   0 % |  7614 / 11019 MB | x(7611M)

I use the commit a7b2d76c35a7c7d23c9edb5d62786aebc1a57009 and Python 3.8 with pytorch(py3.8_cuda10.2_cudnn7.6.5_0).

ddayzzz avatar May 28 '21 01:05 ddayzzz

I have the exact same problem, it seems that grpc is losing its connection randomly. I have tried quite a few things including forking and upgrading grpc and adding a retry config to on the client connection etc. The error still persists and the training just randomly crashes.

I'm running on a Unix Ubuntu Slurm node with 4 A100 Gpus and after a while it just crashes randomly during training all connected to either 0..0.0.0:someport or localhost:someport.

ncioj10 avatar Nov 06 '21 23:11 ncioj10

@danieljanes @ncioj10 Have you already found a solution to this problem? I have a very similar problem where after 5 epochs (approximately 10 minutes) my training crashes due to the "stream removed" error

DEBUG flower 2022-01-05 13:47:23,197 | connection.py:68 | Insecure gRPC channel closed
Traceback (most recent call last):
    fl.client.start_numpy_client("localhost:5001", client, grpc_max_message_length=1073741824)
  File "/opt/conda/lib/python3.7/site-packages/flwr/client/app.py", line 115, in start_numpy_client
    grpc_max_message_length=grpc_max_message_length,
  File "/opt/conda/lib/python3.7/site-packages/flwr/client/app.py", line 64, in start_client
    server_message = receive()
  File "/opt/conda/lib/python3.7/site-packages/flwr/client/grpc_client/connection.py", line 60, in <lambda>
    receive: Callable[[], ServerMessage] = lambda: next(server_message_iterator)
  File "/opt/conda/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/opt/conda/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Stream removed"
        debug_error_string = "{"created":"@1641390443.197185834","description":"Error received from peer ipv4:127.0.0.1:5001","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Stream removed","grpc_status":2}"
>

I am training with two clients (connected to localhost:5000, the server is at 0.0.0.0:5000) on the same GPU (NVIDIA GeForce RTX 3090) in the same docker container with tmux. I'm running on Ubuntu 18.04 with Python 3.7, PyTorch 1.10.0 and flwr 0.17.0.

sabinavanrooij avatar Jan 05 '22 14:01 sabinavanrooij

Yes I did and it is actually a weird bug in the use of the polling strategy by gRPC. Apparently, the Python gRPC implementation uses epollex as a standard, which causes this bug (the processes seem to cause race conditions in the polling queue, or something similar). You have to set export GRPC_POLL_STRATEGY=epoll1, that should resolve the bug.

ncioj10 avatar Jan 05 '22 17:01 ncioj10

Thanks a lot! Your solution seems to work 👍

sabinavanrooij avatar Jan 06 '22 08:01 sabinavanrooij

Thank you so much! I set it up as os.environ['export GRPC_POLL_STRATEGY']='epoll1'

jithishj avatar Jun 01 '22 03:06 jithishj

@jithishj

Thank you so much! I set it up as os.environ['export GRPC_POLL_STRATEGY']='epoll1'

Remove the export keyword, as it is used to define an environment variable in bash.

So, you just need to define:

os.environ["GRPC_POLL_STRATEGY"] = "epoll1"

gquittet avatar Jun 09 '22 14:06 gquittet