Too many pings and one client always disconnects
Describe the bug
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Too many pings"
debug_error_string = "UNKNOWN:Error received from peer ipv4:192.168.229.99:5040 {grpc_message:"Too many pings", grpc_status:14, created_time:"2024-10-07T15:40:46.164225255+02:00"}"
>
I've got my grpc server settings as:
("grpc.http2.max_pings_without_data", 0),
# Is it permissible to send keepalive pings from the client without
# any outstanding streams. More explanation here:
# https://github.com/adap/flower/pull/2197
("grpc.keepalive_permit_without_calls", 0),
but it does not help though
later, i added up two options:
("grpc.http2.max_ping_strikes", 0),
("grpc.http2.min_ping_interval_without_data_ms", 10)
it allowed me escape the initial error, but then I have:
raise GrpcBridgeClosed()
flwr.server.superlink.fleet.grpc_bidi.grpc_bridge.GrpcBridgeClosed
Steps/Code to Reproduce
I use basic FedAvg strategy except that i send additional round of evaluation on each client during aggregate_fit
EvaluateRes = client_proxy.evaluate(ins = evaluate_ins, timeout = None, group_id=rnd) . Sometimes when rerun the clients and server, the error happens after 1 successful round, so it is not always happens the same moment.
Expected Results
Client stays alive
Actual Results
Client disconnects
Did you come to a solution?
Did you come to a solution?
Hello, I am still encountering this problem, and it occurs quite randomly. A few things have helped me reduce the frequency of this issue:
- Run the server and clients on the same machine, so you can use "localhost" as the server address.
- If you're using loops that send messages to clients, try replacing the loop with non-loop code.
Hi @ajulyav,
Thanks for raising this. Are you still experiencing this issue?
Hi @ajulyav,
Could you please paste the code that you used to produce this error?
@WilliamLindskog Hello! I'll try to send you the code, though it's not really code-specific. What I've noticed is that even with the same code/dataset, running it on multiple nodes can cause issues after a few rounds (or sometimes just one round). However, running it on a single node seems to be more stable with some tricks.
I'll try to come back with more details, including the code and sbatch script. Thank you!
So, yesterday I run a simple experiment 4 clients, 1 server
I got this error on only 1 client after 4 global rounds:
File "main_cnn.py", line 77, in <module>
Main(args)
File "main_cnn.py", line 67, in Main
fl.client.start_client(server_address="localhost:8005", client=trainer.to_client())
File "/home/user/.conda/envs/Flower/lib/python3.8/site-packages/flwr/client/app.py", line 157, in start_client
_start_client_internal(
File "/home/user/.conda/envs/Flower/lib/python3.8/site-packages/flwr/client/app.py", line 333, in _start_client_internal
message = receive()
File "/home/user/.conda/envs/Flower/lib/python3.8/site-packages/flwr/client/grpc_client/connection.py", line 144, in receive
proto = next(server_message_iterator)
File "/home/user/.conda/envs/Flower/lib/python3.8/site-packages/grpc/_channel.py", line 543, in __next__
return self._next()
File "/home/user/.conda/envs/Flower/lib/python3.8/site-packages/grpc/_channel.py", line 952, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Too many pings"
debug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:8005 {created_time:"2025-03-11T18:13:54.91980787+01:00", grpc_status:14, grpc_message:"Too many pings"}"
>
My training code is quite simple across all clients it is same, so for other clients it did not cause the same issue:
def train_epoch(self):
train_loss = 0.0
train_ph_acc = 0.0
self.model.train()
for bidx, batch in enumerate(self.train_loader):
self.optimizer.zero_grad()
batch_input, batch_ph, = batch
batch_input, batch_ph = batch_input.cuda(), batch_ph.cuda()
with torch.cuda.amp.autocast(enabled=self.use_amp):
pred_ph = self.model(batch_input, None)
loss = self.loss_func(batch_ph, pred_ph)
train_loss += loss.item()
self.scaler.scale(loss).backward()
self.scaler.step(self.optimizer)
self.scaler.update()
torch.cuda.synchronize()
#some code for logging and computing some metrics on client side
return train_loss, train_ph_acc
So, I assume that the problem is not in the user code.
Hi @ajulyav,
From what I can see, this is code based on a no longer supported version of Flower. Have you tested newer examples like: https://github.com/adap/flower/tree/main/examples/quickstart-pytorch?
You can reproduce your set-up by changing the number of supernodes in pyproject.toml.
options.num-supernodes = 4
Hi @ajulyav,
Just checking in here. Were you able to run same experiment with new flwr code?
Best regards William
Hi @ajulyav,
I am not able to reproduce the error, thus please let me know if there's something I should take a look at. If no response, I will close this issue by the end of this week.
Best regards William
Closing this as we are not able to reproduce the error, also seems to have been resolved in newer versions of flwr.