FedML icon indicating copy to clipboard operation
FedML copied to clipboard

Bug in "MQTT+s3" module: mismatch topics subscribed by clients & failed to train by partial clients per round

Open royukira opened this issue 2 years ago • 1 comments

Hi, I have found some bugs in MQTT+s3 module whenI tried to training by partial clients per round.

Bug 1: Object of type int64 is not JSON serializable

Traceback (most recent call last):

  File "/home/fedml/FedML/python/examples/cross_silo/mqtt_s3_fedavg_mnist_lr_example/one_line/../../../../fedml/core/distributed/communication/mqtt_s3/mqtt_s3_multi_clients_comm_manager.py", line 223, in _on_message_impl
    self._notify(payload_obj)
  File "/home/fedml/FedML/python/examples/cross_silo/mqtt_s3_fedavg_mnist_lr_example/one_line/../../../../fedml/core/distributed/communication/mqtt_s3/mqtt_s3_multi_clients_comm_manager.py", line 185, in _notify
    observer.receive_message(msg_type, msg_params)
  File "/home/fedml/FedML/python/examples/cross_silo/mqtt_s3_fedavg_mnist_lr_example/one_line/../../../../fedml/core/distributed/server/server_manager.py", line 111, in receive_message
    handler_callback_func(msg_params)
  File "/home/fedml/FedML/python/examples/cross_silo/mqtt_s3_fedavg_mnist_lr_example/one_line/../../../../fedml/cross_silo/horizontal/fedml_server_manager.py", line 130, in handle_message_client_status_update
    self.send_init_msg()
  File "/home/fedml/FedML/python/examples/cross_silo/mqtt_s3_fedavg_mnist_lr_example/one_line/../../../../fedml/cross_silo/horizontal/fedml_server_manager.py", line 70, in send_init_msg
    self.send_message_init_config(
  File "/home/fedml/FedML/python/examples/cross_silo/mqtt_s3_fedavg_mnist_lr_example/one_line/../../../../fedml/cross_silo/horizontal/fedml_server_manager.py", line 225, in send_message_init_config
    self.send_message(message)
  File "/home/fedml/FedML/python/examples/cross_silo/mqtt_s3_fedavg_mnist_lr_example/one_line/../../../../fedml/core/distributed/server/server_manager.py", line 114, in send_message
    self.com_manager.send_message(message)
  File "/home/fedml/FedML/python/examples/cross_silo/mqtt_s3_fedavg_mnist_lr_example/one_line/../../../../fedml/core/distributed/communication/mqtt_s3/mqtt_s3_multi_clients_comm_manager.py", line 284, in send_message
    self._client.publish(topic, payload=json.dumps(payload))
  File "/home/fedml/anaconda3/envs/fedml/lib/python3.8/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/home/fedml/anaconda3/envs/fedml/lib/python3.8/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/home/fedml/anaconda3/envs/fedml/lib/python3.8/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/home/fedml/anaconda3/envs/fedml/lib/python3.8/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '

The first bug is caused by aggregator.client_selection() which return a numpy.ndarray indices. However, the json.dumps() cannot process numpy.int64 data.

Bug 2: Subscribing mismatch topics

When I try to train cnn+mnist by partial clients per round (I set client_num_per_round: 2 before launching 4 clients/ranks for training), the sever will be stucked after publishing init messages to the chosen clients.

So, I check the code in _on_connect_impl function, and I find that all clients subscribe fedml_0_0_1 rather than thier topics fedml_0_0_{client_id}. I have no idea why all clients subscribe the fedml_0_0_1.

Back to my issue, the server has chosen Client 3 and Client 4 participating the first training round, and sent messages to topic "fedml_0_0_3" and "fedml_0_0_4" respectively. But, they obviously cannot recieve any message as they only subscribes an irrelevant topic "fedml_0_0_1". So, the chosen clients do nothing and the serve gets stuck.

############ Server ############
[FedML-Server(0) @device-id-0] [Thu, 26 May 2022 11:54:51] [INFO] [mqtt_s3_multi_clients_comm_manager.py:249:send_message] mqtt_s3.send_message: msg topic = fedml_0_0_3
[FedML-Server(0) @device-id-0] [Thu, 26 May 2022 11:54:51] [INFO] [mqtt_s3_multi_clients_comm_manager.py:256:send_message] mqtt_s3.send_message: S3+MQTT msg sent, s3 message key = fedml_0_0_3_server
[FedML-Server(0) @device-id-0] [Thu, 26 May 2022 11:54:51] [INFO] [mqtt_s3_multi_clients_comm_manager.py:268:send_message] mqtt_s3.send_message: to python client.
{
    "msg_type": 1,
    "sender": 0,    "receiver": 3,
    "model_params": "fedml_0_0_3_server",
    "client_idx": "993",
    "client_os": "PythonClient",    
    "model_params_url": "https://s3.ai-team.dev/fedml/fedml_0_0_3_server?AWSAccessKeyId="
}

[FedML-Server(0) @device-id-0] [Thu, 26 May 2022 11:54:51] [INFO] [mqtt_s3_multi_clients_comm_manager.py:249:send_message] mqtt_s3.send_message: msg topic = fedml_0_0_4
[FedML-Server(0) @device-id-0] [Thu, 26 May 2022 11:54:51] [INFO] [mqtt_s3_multi_clients_comm_manager.py:256:send_message] mqtt_s3.send_message: S3+MQTT msg sent, s3 message key = fedml_0_0_4_server
[FedML-Server(0) @device-id-0] [Thu, 26 May 2022 11:54:51] [INFO] [mqtt_s3_multi_clients_comm_manager.py:268:send_message] mqtt_s3.send_message: to python client.
{
    "msg_type": 1,
    "sender": 0,
    "receiver": 4,
    "model_params": "fedml_0_0_4_server",
    "client_idx": "859",
    "client_os": "PythonClient",
    "model_params_url": "https://s3.ai-team.dev/fedml/fedml_0_0_4_server?AWSAccessKeyId="
}

############ Client 3 ############
[FedML-Client(3) @device-id-1] [Thu, 26 May 2022 11:53:03] [INFO] [mqtt_s3_multi_clients_comm_manager.py:243:send_message] mqtt_s3.send_message: starting...
[FedML-Client(3) @device-id-1] [Thu, 26 May 2022 11:53:03] [INFO] [mqtt_s3_multi_clients_comm_manager.py:327:send_message] mqtt_s3.send_message: MQTT msg sent
[FedML-Client(3) @device-id-1] [Thu, 26 May 2022 11:53:03] [INFO] [mqtt_s3_multi_clients_comm_manager.py:144:_on_connect_impl] mqtt_s3.on_connect: client subscribes real_topic = fedml_0_0_1, mid = 1, result = 0
[FedML-Client(3) @device-id-1] [Thu, 26 May 2022 11:53:03] [INFO] [mqtt_s3_multi_clients_comm_manager.py:164:_on_subscribe] mqtt_s3.onSubscribe: mid = 1


############ Client 4 ############
[FedML-Client(4) @device-id-1] [Thu, 26 May 2022 11:54:47] [INFO] [mqtt_s3_multi_clients_comm_manager.py:243:send_message] mqtt_s3.send_message: starting...
[FedML-Client(4) @device-id-1] [Thu, 26 May 2022 11:54:47] [INFO] [mqtt_s3_multi_clients_comm_manager.py:327:send_message] mqtt_s3.send_message: MQTT msg sent
[FedML-Client(4) @device-id-1] [Thu, 26 May 2022 11:54:47] [INFO] [mqtt_s3_multi_clients_comm_manager.py:144:_on_connect_impl] mqtt_s3.on_connect: client subscribes real_topic = fedml_0_0_1, mid = 1, result = 0
[FedML-Client(4) @device-id-1] [Thu, 26 May 2022 11:54:47] [INFO] [mqtt_s3_multi_clients_comm_manager.py:164:_on_subscribe] mqtt_s3.onSubscribe: mid = 1

Furthermore, although "all clients/ranks participating" can run without any exceptions, I thought the above problem still affect model training. Because all clients recieve the same message and use the same client_idx sub-dataset. In other words, all clients only train on the same subset of dataset which is from one data holder. The rest of chosen sub-dataset from different data holder are not used.

Bug 3: Lack of communication/sync machinism on finishing phase

Accroding to code, the server immediately calls self.finish() after self.send_message_sync_model_to_client without any check or notification in the final round. It's fine for all clients/ranks participating, as their self.round_idx is synchronous. So, the server and clients can disconnect and stop correctly.

However, for partial clients participating, it leads to two errors: (1) the server stops while the chosen clients are still processing its message ; (2) the clients cannot exit correctly, because their self.round_idx is asynchronous and the server does not notify them "it's time to stop".

I suggest that server and clients should have a communication/sync machinism on finishing phase.

royukira avatar May 26 '22 07:05 royukira

@royukira Thanks for your feedback. we've iterated our latest version according to your feedback. Please check whether the latest version still has such issues.

chaoyanghe avatar Aug 19 '22 16:08 chaoyanghe

Hello @royukira, can you please try out on the latest dev branch?

fedml-dimitris avatar Oct 25 '23 01:10 fedml-dimitris