FedML icon indicating copy to clipboard operation
FedML copied to clipboard

Communication stuck randomly

Open nzmkNARUTO opened this issue 4 years ago • 2 comments

I modified the way of communication in distribution FL, making the communication parallel. I mainly change the code of the section: In file FedAvgServerManager.py

for receiver_id in range(1, self.size):
       # self.send_message_sync_model_to_client(receiver_id,global_model_params,client_indexes[receiver_id-1])
        multi_processing=mp.Process(target=self.send_message_sync_model_to_client,args=(receiver_id,global_model_params,client_indexes[receiver_id-1]))
        multi_processing.start()

I'm using multiprocessing module to make it work in different process. But there are few strange things I have met. (I'm using GRPC backend, and set communication rounds as 5)

  1. when the server and simulated client on the same machine, it works as expected
  2. when I use another PC running simulated client, it sticks randomly. Sometimes it works, sometimes not.
  3. when I use raspberrypi, it sticks every time. But mostly stick in communication round 3 or 4.
  4. when I use jetson-nano, it sticks at the first round.
  5. when I use them both, it sticks at communication total round 3 or 4.( e.g. first round to simulated client, second round to pi, third round to jetson, and it get stuck!)

I have used wireshark to capture tcp communication package, and I found something interesting. From what I known, communication of GRPC is using persistent connection, which means it should open and keep a connection to client for every clients. But in dump file, every time before it get stuck, server send a tcp [RST] or [ACK, RST] flag to client ,cutting down the connection, but client get no error message. I can not figure out why. Interestingly, when the server communicate with simulated client on the same machine, server sends [RST] flag after every communication round, which I think is not properly. And all these tests are in the same local area network.

Here're some of my tcp dump file: tcpdump.zip

ubuntu-ubuntu is server and simulated client (which works fine but seems wrong) ubuntu-helix1 is server and another pc running simulated client ubuntu-raspberrypi is server and raspberrypi

the server and clients give no error message

nzmkNARUTO avatar Dec 09 '21 12:12 nzmkNARUTO

Got it. I will try to reproduce this issue and find out the solution.

fedml-alex avatar Mar 11 '22 17:03 fedml-alex

@nzmkNARUTO Hi, we've iterated a lot in the past few months. I am not sure whether this issue still exists. Could you please help to reproduce it again in our latest version? Here are some new examples and applications at:

Examples of FedML Python Library: FedML/python/examples

Examples of FedML IoT: FedML/iot

Examples of FedML Android: FedML/android

Adavanced Applications developed on FedML: https://github.com/FedML-AI/FedML/tree/master/python/app

chaoyanghe avatar Aug 19 '22 17:08 chaoyanghe