Communication stuck randomly
I modified the way of communication in distribution FL, making the communication parallel. I mainly change the code of the section: In file FedAvgServerManager.py
for receiver_id in range(1, self.size):
# self.send_message_sync_model_to_client(receiver_id,global_model_params,client_indexes[receiver_id-1])
multi_processing=mp.Process(target=self.send_message_sync_model_to_client,args=(receiver_id,global_model_params,client_indexes[receiver_id-1]))
multi_processing.start()
I'm using multiprocessing module to make it work in different process. But there are few strange things I have met. (I'm using GRPC backend, and set communication rounds as 5)
- when the server and simulated client on the same machine, it works as expected
- when I use another PC running simulated client, it sticks randomly. Sometimes it works, sometimes not.
- when I use raspberrypi, it sticks every time. But mostly stick in communication round 3 or 4.
- when I use jetson-nano, it sticks at the first round.
- when I use them both, it sticks at communication total round 3 or 4.( e.g. first round to simulated client, second round to pi, third round to jetson, and it get stuck!)
I have used wireshark to capture tcp communication package, and I found something interesting. From what I known, communication of GRPC is using persistent connection, which means it should open and keep a connection to client for every clients. But in dump file, every time before it get stuck, server send a tcp [RST] or [ACK, RST] flag to client ,cutting down the connection, but client get no error message. I can not figure out why. Interestingly, when the server communicate with simulated client on the same machine, server sends [RST] flag after every communication round, which I think is not properly. And all these tests are in the same local area network.
Here're some of my tcp dump file: tcpdump.zip
ubuntu-ubuntu is server and simulated client (which works fine but seems wrong) ubuntu-helix1 is server and another pc running simulated client ubuntu-raspberrypi is server and raspberrypi
the server and clients give no error message
Got it. I will try to reproduce this issue and find out the solution.
@nzmkNARUTO Hi, we've iterated a lot in the past few months. I am not sure whether this issue still exists. Could you please help to reproduce it again in our latest version? Here are some new examples and applications at:
Examples of FedML Python Library: FedML/python/examples
Examples of FedML IoT: FedML/iot
Examples of FedML Android: FedML/android
Adavanced Applications developed on FedML: https://github.com/FedML-AI/FedML/tree/master/python/app