FedML icon indicating copy to clipboard operation
FedML copied to clipboard

[customer requirement] test the comparability issues on Windows

Open chaoyanghe opened this issue 2 years ago • 13 comments

chaoyanghe avatar May 02 '22 06:05 chaoyanghe

error with test/fedml_user_code/cross_silo example

OS: Windows 10;Version: 21H2 (internal version 19044.1645) Python Version: 3.7 Package Version: fedml 0.7.12

When running the run_server.sh and run_client.sh scripts separately according to test/fedml_user_code/cross_silo/README.md, the program fails to run properly.

The output of the server:

ThinkPad@LAPTOP-M816KBBA MINGW64 /g/python projects/FedML/test/fedml_user_code/cross_silo (master)
$ bash run_server.sh
......
[mqtt_s3_multi_clients_comm_manager.py:187:_on_message_impl] --------------------------
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:06:08] [INFO] [mqtt_s3_multi_clients_comm_manager.py:218:_on_message_impl] mqtt_s3.on_message: not use s3 pack
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:06:08] [INFO] [mqtt_s3_multi_clients_comm_manager.py:182:_notify] mqtt_s3.notify: msg type = 5
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:06:08] [INFO] [server_manager.py:108:receive_message] receive_message. rank_id = 0, msg_type = 5.
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:06:08] [INFO] [fedml_server_manager.py:113:handle_message_client_status_update] sender_id = 2, all_client_is_online = True
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:06:08] [INFO] [fedml_aggregator.py:119:data_silo_selection] client_num_in_total = 1000, client_num_per_round = 2
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:06:08] [INFO] [mqtt_s3_multi_clients_comm_manager.py:240:send_message] mqtt_s3.send_message: starting...
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:06:08] [INFO] [mqtt_s3_multi_clients_comm_manager.py:246:send_message] mqtt_s3.send_message: msg topic = fedml_0_0_1
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:06:08] [INFO] [mqtt_s3_multi_clients_comm_manager.py:255:send_message] mqtt_s3.send_message: S3+MQTT msg sent, s3 message key = f
edml_0_0_1_32f954bc-a668-4fa7-89dd-1b52d6dc8207
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:06:08] [INFO] [mqtt_s3_multi_clients_comm_manager.py:265:send_message] mqtt_s3.send_message: to python client.
Traceback (most recent call last):
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\communication\mqtt_s3\mqtt_s3_multi_clients_comm_manager.py", line 224, in _on_message
    self._on_message_impl(client, userdata, msg)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\communication\mqtt_s3\mqtt_s3_multi_clients_comm_manager.py", line 220, in _on_message_impl
    self._notify(payload_obj)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\communication\mqtt_s3\mqtt_s3_multi_clients_comm_manager.py", line 184, in _notify
    observer.receive_message(msg_type, msg_params)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\server\server_manager.py", line 111, in receive_message
    handler_callback_func(msg_params)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\cross_silo\horizontal\fedml_server_manager.py", line 118, in handle_message_client_status_update
    self.send_init_msg()
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\cross_silo\horizontal\fedml_server_manager.py", line 68, in send_init_msg
    data_silo_index_list[client_idx_in_this_round],
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\cross_silo\horizontal\fedml_server_manager.py", line 205, in send_message_init_config
    self.send_message(message)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\server\server_manager.py", line 114, in send_message
    self.com_manager.send_message(message)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\communication\mqtt_s3\mqtt_s3_multi_clients_comm_manager.py", line 267, in send_message
    message_key, model_params_obj
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\communication\mqtt_s3\remote_storage.py", line 48, in write_model
    ACL="public-read",
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\botocore\client.py", line 415, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\botocore\client.py", line 732, in _make_api_call
    operation_model, request_dict, request_context)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\botocore\client.py", line 751, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\botocore\endpoint.py", line 107, in make_request
    return self._send_request(request_dict, operation_model)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\botocore\endpoint.py", line 180, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\botocore\endpoint.py", line 121, in create_request
    operation_name=operation_model.name)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\botocore\hooks.py", line 358, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\botocore\hooks.py", line 229, in emit
    return self._emit(event_name, kwargs)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\botocore\hooks.py", line 212, in _emit
    response = handler(**kwargs)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\botocore\signers.py", line 95, in handler
    return self.sign(operation_name, request)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\botocore\signers.py", line 167, in sign
    auth.add_auth(request)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\botocore\auth.py", line 401, in add_auth
    raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:06:09] [INFO] [mqtt_s3_multi_clients_comm_manager.py:159:_on_disconnect] mqtt_s3.on_disconnect: disconnection returned result 0,
user data None
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:06:09] [INFO] [mqtt_s3_status_manager.py:80:_on_disconnect] mqtt_s3.on_disconnect: disconnection returned result 0, user data None

The output of the client1:


ThinkPad@LAPTOP-M816KBBA MINGW64 /g/python projects/FedML/test/fedml_user_code/cross_silo (master)
$ bash run_client.sh 1

......
[FedML-Client(1) @device-id-1] [Mon, 02 May 2022 15:00:00] [INFO] [client_manager.py:115:send_message] Sending message (type 5) to server
[FedML-Client(1) @device-id-1] [Mon, 02 May 2022 15:00:00] [INFO] [mqtt_s3_multi_clients_comm_manager.py:240:send_message] mqtt_s3.send_message: starting...
[FedML-Client(1) @device-id-1] [Mon, 02 May 2022 15:00:00] [INFO] [mqtt_s3_multi_clients_comm_manager.py:322:send_message] mqtt_s3.send_message: MQTT msg sent
Traceback (most recent call last):
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\communication\mqtt_s3\mqtt_s3_multi_clients_comm_manager.py", line 150, in _on_connect
    self._on_connect_impl(client, userdata, flags, rc)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\communication\mqtt_s3\mqtt_s3_multi_clients_comm_manager.py", line 141, in _on_connect_impl
    self._notify_connection_ready()
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\communication\mqtt_s3\mqtt_s3_multi_clients_comm_manager.py", line 176, in _notify_connecti
on_ready
    observer.receive_message(msg_type, msg_params)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\client\client_manager.py", line 103, in receive_message
    handler_callback_func(msg_params)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\cross_silo\horizontal\fedml_client_manager.py", line 62, in handle_message_connection_ready
    self.sys_stats_process.start()
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'botocore.client.S3'>: attribute lookup S3 on botocore.client failed
[FedML-Client(1) @device-id-1] [Mon, 02 May 2022 15:00:01] [INFO] [mqtt_s3_multi_clients_comm_manager.py:159:_on_disconnect] mqtt_s3.on_disconnect: disconnection returned result 0,
user data None
[FedML-Client(1) @device-id-1] [Mon, 02 May 2022 15:00:01] [INFO] [mqtt_s3_status_manager.py:80:_on_disconnect] mqtt_s3.on_disconnect: disconnection returned result 0, user data None

ThinkPad@LAPTOP-M816KBBA MINGW64 /g/python projects/FedML/test/fedml_user_code/cross_silo (master)
$ Traceback (most recent call last):

  File "<string>", line 1, in <module>
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input


The output of the client2:

ThinkPad@LAPTOP-M816KBBA MINGW64 /g/python projects/FedML/test/fedml_user_code/cross_silo (master)
$ bash run_client.sh 2

......
[FedML-Client(2) @device-id-1] [Mon, 02 May 2022 14:56:59] [INFO] [device.py:35:get_device] device = cpu
[FedML-Client(2) @device-id-1] [Mon, 02 May 2022 14:56:59] [INFO] [data_loader.py:22:download_mnist] ../../../../data/mnist/MNIST.zip
[FedML-Client(2) @device-id-1] [Mon, 02 May 2022 14:57:16] [INFO] [data_loader.py:57:load_synthetic_data] load_data. dataset_name = mnist
Traceback (most recent call last):
  File "client/torch_client.py", line 5, in <module>
    fedml.run_cross_silo_client()
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\__init__.py", line 137, in run_cross_silo_client
    dataset, output_dim = fedml.data.load(args)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\data\data_loader.py", line 30, in load
    return load_synthetic_data(args)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\data\data_loader.py", line 71, in load_synthetic_data
    test_path=args.data_cache_dir + "/MNIST/test",
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\data\MNIST\data_loader.py", line 114, in load_partition_data_mnist
    users, groups, train_data, test_data = read_data(train_path, test_path)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\data\MNIST\data_loader.py", line 56, in read_data
    cdata = json.load(inf)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\json\__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\json\__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\json\decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
MemoryError

@chaoyanghe

Nicole456 avatar May 02 '22 07:05 Nicole456

@alex-liang-kh @Nicole456 Hi Alex, we need to wrapper S3 service as HTTPS API for our users.

chaoyanghe avatar May 02 '22 07:05 chaoyanghe

problem with test/fedml_user_code/simulation_mpi example

When running python main.py under test/fedml_user_code/cross_silo, the program gets stuck here for a pretty long time ,without any other output:

FedML/test/fedml_user_code/simulation_mpi (master)
$ python main.py
......
################## You do not indicate gpu_util_file, will use CPU training  #################
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:21:05] [INFO] [gpu_mapping.py:17:mapping_processes_to_gpu_device_from_yaml_file] cpu
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:21:05] [INFO] [data_loader.py:22:download_mnist] ./data/mnist/MNIST.zip
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:21:11] [INFO] [data_loader.py:57:load_synthetic_data] load_data. dataset_name = mnist
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:21:53] [INFO] [data_loader.py:126:load_partition_data_mnist] loading data...
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:21:57] [INFO] [data_loader.py:144:load_partition_data_mnist] finished the loading data
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:21:58] [INFO] [model_hub.py:16:create] create_model. model_name = lr, output_dim = 10
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:21:58] [INFO] [model_hub.py:19:create] LogisticRegression + MNIST
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 15:21:58] [INFO] [FedAVGAggregator.py:112:client_sampling] client_indexes = [993 859 298 553]

@chaoyanghe

Nicole456 avatar May 02 '22 07:05 Nicole456

error with test/fedml_user_code/cross_device

When I try to run the server with the following command: 4. start the python server at python/examples/cross_device/mqtt_s3_fedavg_mnist_lr_example/custum_data_and_model/

bash run_server.sh

, the server-side reports the following error output:

/FedML/test/fedml_user_code/cross_device (master)
$ bash run_server.sh

[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 16:30:45] [INFO] [__init__.py:30:init] args = {'yaml_config_file': './config/fedml_config.yaml', 'run_id': '189', 'rank': 0, 'yaml_paths': ['D:\\ProgramData\\Miniconda3\\envs\\FedML0502\
\lib\\site-packages\\fedml\\config/simulation_sp/fedml_config.yaml'], 'training_type': 'simulation', 'using_mlops': False, 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': './data/mnist', 'partition_method': 'hetero', 'partition
_alpha': 0.5, 'model': 'lr', 'federated_optimizer': 'FedAvg', 'client_id_list': '[]', 'client_num_in_total': 1000, 'client_num_per_round': 10, 'comm_round': 200, 'epochs': 1, 'batch_size': 10, 'client_optimizer': 'sgd', 'learning_rate'
: 0.03, 'weight_decay': 0.001, 'frequency_of_the_test': 5, 'using_gpu': False, 'gpu_id': 0, 'backend': 'single_process', 'log_file_dir': './log', 'enable_wandb': False}
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 16:30:45] [INFO] [device.py:14:get_device] device = cpu
Traceback (most recent call last):
  File "torch_server.py", line 32, in <module>
    create_mnn_lenet5_model(args.global_model_file_path)
AttributeError: 'Arguments' object has no attribute 'global_model_file_path'

@chaoyanghe

Nicole456 avatar May 02 '22 08:05 Nicole456

@Nicole456 the latest two issues you reported are fixed. Please check out fedml==0.7.13

chaoyanghe avatar May 02 '22 08:05 chaoyanghe

problem with test/fedml_user_code/cross_device example

The device is always in the initialized state:

image

The server is stuck here and can't continue to run, there is an error message in the output:

[Errno 2] No such file or directory: './model_file_cache/global_model.mnn'

register_message_receive_handlers------
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:21] [INFO] [mqtt_s3_comm_manager.py:105:_on_connect_impl] mqtt_s3.on_connect: connection returned with result code:0
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:21] [INFO] [mqtt_s3_comm_manager.py:117:_on_connect_impl] mqtt_s3.on_connect: server subscribes real_topic = fedml_189_146, mid = 1, result = 0
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:21] [INFO] [mqtt_s3_comm_manager.py:145:_on_subscribe] mqtt_s3.onSubscribe: mid = 1
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [INFO] [mqtt_s3_comm_manager.py:163:_on_message_impl] --------------------------
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [INFO] [mqtt_s3_comm_manager.py:166:_on_message_impl] mqtt_s3.on_message: payload_obj {'client_os': 'Android', 'client_status': 'ONLINE', 'msg_type': 5, 'receiver': 0, 'sender'
: 146}
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [INFO] [mqtt_s3_comm_manager.py:183:_on_message_impl] mqtt_s3.on_message: not use s3 pack
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [INFO] [mqtt_s3_comm_manager.py:158:_notify] mqtt_s3.notify: msg type = 5
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [INFO] [server_manager.py:108:receive_message] receive_message. rank_id = 0, msg_type = 5.
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [INFO] [mlops_profiler_event.py:54:log_event_started] Event started, {"run_id": "189", "edge_id": 0, "event_name": "aggregator.wait-online", "event_value": "", "started_time":
1651495342}
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [INFO] [fedml_server_manager.py:180:handle_message_client_status_update] sender_id = 146, all_client_is_online = True
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [INFO] [fedml_aggregator.py:112:data_silo_selection] data_silo_num_in_total = 1, client_num_in_total = 1
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [INFO] [fedml_server_manager.py:139:send_init_msg] client_id_list_in_this_round = [146], data_silo_index_list = [0]
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [INFO] [fedml_server_manager.py:263:send_message_init_config] global_model_params = ./model_file_cache/global_model.mnn
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [INFO] [mqtt_s3_comm_manager.py:205:send_message] mqtt_s3.send_message: starting...{'msg_type': 1, 'sender': 0, 'receiver': 146, 'model_params': './model_file_cache/global_mode
l.mnn', 'client_idx': '0'}
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [INFO] [mqtt_s3_comm_manager.py:212:send_message] mqtt_s3.send_message: msg topic = fedml_189_0_146
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [INFO] [mqtt_s3_comm_manager.py:221:send_message] mqtt_s3.send_message: S3+MQTT msg sent, s3 message key = fedml_189_0_146_161e9da4-2ddf-4d28-a54d-f381226fab59
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [ERROR] [remote_storage.py:105:upload_file] Upload data failed. | src: ./model_file_cache/global_model.mnn | dest: fedml_189_0_146_161e9da4-2ddf-4d28-a54d-f381226fab59 | Except
ion: [Errno 2] No such file or directory: './model_file_cache/global_model.mnn'
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [INFO] [mlops_profiler_event.py:54:log_event_started] Event started, {"run_id": "189", "edge_id": 0, "event_name": "server.wait", "event_value": "", "started_time": 1651495342}
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:42:22] [INFO] [mlops_profiler_event.py:76:log_event_ended] Event ended, {"run_id": "189", "edge_id": 0, "event_name": "aggregator.wait-online", "event_value": "", "ended_time": 165149
5342}

@chaoyanghe

Nicole456 avatar May 02 '22 13:05 Nicole456

problem with test/fedml_user_code/simulation_mpi

sh run_one_line_example.sh 4

Since mpirun in windows corresponds to mpiexec, I modified run_one_line_example.sh as follows:

#!/usr/bin/env bash

hostname > mpi_host_file

mpiexec -np 5 \
#-hostfile mpi_host_file \
python main.py --cf fedml_config.yaml

The program is stuck in the following state:

ThinkPad@LAPTOP-M816KBBA MINGW64 /g/python projects/FedML/test/fedml_user_code/simulation_mpi (master)
$ sh run_one_line_example.sh 4
Error: no executable specified.

[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:53] [INFO] [__init__.py:30:init] args = {'yaml_config_file': 'fedml_config.yaml', 'run_id': '0', 'rank': 0, 'yaml_paths': ['D:\\ProgramData\\Miniconda3\\envs\\FedML0502\\lib\\site-
packages\\fedml\\config/simulaton_mpi/fedml_config.yaml'], 'training_type': 'simulation', 'using_mlops': False, 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': './data/mnist', 'partition_method': 'hetero', 'partition_alpha': 0.
5, 'model': 'lr', 'federated_optimizer': 'FedAvg', 'client_id_list': '[]', 'client_num_in_total': 1000, 'client_num_per_round': 4, 'comm_round': 50, 'epochs': 1, 'batch_size': 10, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weig
ht_decay': 0.001, 'frequency_of_the_test': 5, 'worker_num': 4, 'using_gpu': False, 'gpu_mapping_file': 'D:\\ProgramData\\Miniconda3\\envs\\FedML0502\\lib\\site-packages\\fedml\\config/simulaton_mpi/gpu_mapping.yaml', 'gpu_mapping_key':
 'mapping_default', 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr'}
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:53] [INFO] [gpu_mapping.py:13:mapping_processes_to_gpu_device_from_yaml_file]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:53] [INFO] [gpu_mapping.py:15:mapping_processes_to_gpu_device_from_yaml_file]  ################## You do not indicate gpu_util_file, will use CPU training  #################
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:53] [INFO] [gpu_mapping.py:17:mapping_processes_to_gpu_device_from_yaml_file] cpu
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:53] [INFO] [data_loader.py:22:download_mnist] ./data/mnist/MNIST.zip
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:59] [INFO] [data_loader.py:57:load_synthetic_data] load_data. dataset_name = mnist
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:04:01] [INFO] [data_loader.py:126:load_partition_data_mnist] loading data...
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:04:06] [INFO] [data_loader.py:144:load_partition_data_mnist] finished the loading data
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:04:07] [INFO] [model_hub.py:16:create] create_model. model_name = lr, output_dim = 10
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:04:07] [INFO] [model_hub.py:19:create] LogisticRegression + MNIST
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:04:07] [INFO] [FedAVGAggregator.py:112:client_sampling] client_indexes = [993 859 298 553]

Nicole456 avatar May 02 '22 13:05 Nicole456

problem with test/fedml_user_code/simulation_mpi

sh run_one_line_example.sh 4

Since mpirun in windows corresponds to mpiexec, I modified run_one_line_example.sh as follows:

#!/usr/bin/env bash

hostname > mpi_host_file

mpiexec -np 5 \
#-hostfile mpi_host_file \
python main.py --cf fedml_config.yaml

The program is stuck in the following state:

ThinkPad@LAPTOP-M816KBBA MINGW64 /g/python projects/FedML/test/fedml_user_code/simulation_mpi (master)
$ sh run_one_line_example.sh 4
Error: no executable specified.

[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:53] [INFO] [__init__.py:30:init] args = {'yaml_config_file': 'fedml_config.yaml', 'run_id': '0', 'rank': 0, 'yaml_paths': ['D:\\ProgramData\\Miniconda3\\envs\\FedML0502\\lib\\site-
packages\\fedml\\config/simulaton_mpi/fedml_config.yaml'], 'training_type': 'simulation', 'using_mlops': False, 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': './data/mnist', 'partition_method': 'hetero', 'partition_alpha': 0.
5, 'model': 'lr', 'federated_optimizer': 'FedAvg', 'client_id_list': '[]', 'client_num_in_total': 1000, 'client_num_per_round': 4, 'comm_round': 50, 'epochs': 1, 'batch_size': 10, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weig
ht_decay': 0.001, 'frequency_of_the_test': 5, 'worker_num': 4, 'using_gpu': False, 'gpu_mapping_file': 'D:\\ProgramData\\Miniconda3\\envs\\FedML0502\\lib\\site-packages\\fedml\\config/simulaton_mpi/gpu_mapping.yaml', 'gpu_mapping_key':
 'mapping_default', 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr'}
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:53] [INFO] [gpu_mapping.py:13:mapping_processes_to_gpu_device_from_yaml_file]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:53] [INFO] [gpu_mapping.py:15:mapping_processes_to_gpu_device_from_yaml_file]  ################## You do not indicate gpu_util_file, will use CPU training  #################
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:53] [INFO] [gpu_mapping.py:17:mapping_processes_to_gpu_device_from_yaml_file] cpu
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:53] [INFO] [data_loader.py:22:download_mnist] ./data/mnist/MNIST.zip
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:59] [INFO] [data_loader.py:57:load_synthetic_data] load_data. dataset_name = mnist
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:04:01] [INFO] [data_loader.py:126:load_partition_data_mnist] loading data...
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:04:06] [INFO] [data_loader.py:144:load_partition_data_mnist] finished the loading data
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:04:07] [INFO] [model_hub.py:16:create] create_model. model_name = lr, output_dim = 10
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:04:07] [INFO] [model_hub.py:19:create] LogisticRegression + MNIST
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:04:07] [INFO] [FedAVGAggregator.py:112:client_sampling] client_indexes = [993 859 298 553]

for this one, please update your souce code. It's fixed.

FedML-AI-admin avatar May 03 '22 00:05 FedML-AI-admin

'./model_file_cache/global_model.mnn'

for this one, please report your step-by-step operation

FedML-AI-admin avatar May 03 '22 01:05 FedML-AI-admin

problem with test/fedml_user_code/simulation_mpi

sh run_one_line_example.sh 4

Since mpirun in windows corresponds to mpiexec, I modified run_one_line_example.sh as follows:

#!/usr/bin/env bash

hostname > mpi_host_file

mpiexec -np 5 \
#-hostfile mpi_host_file \
python main.py --cf fedml_config.yaml

The program is stuck in the following state:

ThinkPad@LAPTOP-M816KBBA MINGW64 /g/python projects/FedML/test/fedml_user_code/simulation_mpi (master)
$ sh run_one_line_example.sh 4
Error: no executable specified.

[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:53] [INFO] [__init__.py:30:init] args = {'yaml_config_file': 'fedml_config.yaml', 'run_id': '0', 'rank': 0, 'yaml_paths': ['D:\\ProgramData\\Miniconda3\\envs\\FedML0502\\lib\\site-
packages\\fedml\\config/simulaton_mpi/fedml_config.yaml'], 'training_type': 'simulation', 'using_mlops': False, 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': './data/mnist', 'partition_method': 'hetero', 'partition_alpha': 0.
5, 'model': 'lr', 'federated_optimizer': 'FedAvg', 'client_id_list': '[]', 'client_num_in_total': 1000, 'client_num_per_round': 4, 'comm_round': 50, 'epochs': 1, 'batch_size': 10, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weig
ht_decay': 0.001, 'frequency_of_the_test': 5, 'worker_num': 4, 'using_gpu': False, 'gpu_mapping_file': 'D:\\ProgramData\\Miniconda3\\envs\\FedML0502\\lib\\site-packages\\fedml\\config/simulaton_mpi/gpu_mapping.yaml', 'gpu_mapping_key':
 'mapping_default', 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr'}
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:53] [INFO] [gpu_mapping.py:13:mapping_processes_to_gpu_device_from_yaml_file]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:53] [INFO] [gpu_mapping.py:15:mapping_processes_to_gpu_device_from_yaml_file]  ################## You do not indicate gpu_util_file, will use CPU training  #################
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:53] [INFO] [gpu_mapping.py:17:mapping_processes_to_gpu_device_from_yaml_file] cpu
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:53] [INFO] [data_loader.py:22:download_mnist] ./data/mnist/MNIST.zip
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:02:59] [INFO] [data_loader.py:57:load_synthetic_data] load_data. dataset_name = mnist
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:04:01] [INFO] [data_loader.py:126:load_partition_data_mnist] loading data...
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:04:06] [INFO] [data_loader.py:144:load_partition_data_mnist] finished the loading data
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:04:07] [INFO] [model_hub.py:16:create] create_model. model_name = lr, output_dim = 10
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:04:07] [INFO] [model_hub.py:19:create] LogisticRegression + MNIST
[FedML-Server(0) @device-id-0] [Mon, 02 May 2022 20:04:07] [INFO] [FedAVGAggregator.py:112:client_sampling] client_indexes = [993 859 298 553]

for this one, please update your souce code. It's fixed.

I successfully ran this example by running the following command:

mpiexec -np 2 python main.py --cf fedml_config.yaml

On my Windows10 PC if set to -np 5, it causes MemoryError, and I can't run the case via sh run_one_line_example.sh 4 P.S. I modified run_one_line_example.sh as follows

#!/usr/bin/env bash

hostname > mpi_host_file

mpiexec -np 2 \
#--hostfile mpi_host_file \
python main.py --cf fedml_config.yaml

That is, after I modified run_one_line_example.sh, I ran sh run_one_line_example.sh and still got the results reported before

Nicole456 avatar May 03 '22 05:05 Nicole456

@Nicole456 please summarize the remaining issues you have on Windows for all the test cases...It's a long message, I may miss some errors you reported.

chaoyanghe avatar May 03 '22 06:05 chaoyanghe

The remaining issues on Windows

test/fedml_user_code/simulation_mpi

I successfully ran this example by running the following command:

mpiexec -np 2 python main.py --cf fedml_config.yaml

On my PC if set to -np 5, it causes MemoryError, and I can't run the case via sh run_one_line_example.sh 4 P.S. I modified run_one_line_example.sh as follows

#!/usr/bin/env bash

hostname > mpi_host_file

mpiexec -np 2 \
#--hostfile mpi_host_file \
python main.py --cf fedml_config.yaml

That is, after I modified run_one_line_example.sh, I ran sh run_one_line_example.sh and still got the results reported before,that is, the program is stuck in the place shown below:

/python projects/FedML/test/fedml_user_code/simulation_mpi (master)
$ bash run_one_line_example.sh
Error: no executable specified.

[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:19:46] [INFO] [__init__.py:30:init] args = {'yaml_config_file': 'fedml_config.yaml', 'run_id': '0', 'rank': 0, 'yaml_paths': ['D:\\ProgramData\\Miniconda3\\envs\\FedML0502\\lib\\site-
packages\\fedml\\config/simulaton_mpi/fedml_config.yaml'], 'training_type': 'simulation', 'using_mlops': False, 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': './data/mnist', 'partition_method': 'hetero', 'partition_alpha': 0.
5, 'model': 'lr', 'federated_optimizer': 'FedAvg', 'client_id_list': '[]', 'client_num_in_total': 1000, 'client_num_per_round': 4, 'comm_round': 50, 'epochs': 1, 'batch_size': 10, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weig
ht_decay': 0.001, 'frequency_of_the_test': 5, 'worker_num': 4, 'using_gpu': False, 'gpu_mapping_file': 'D:\\ProgramData\\Miniconda3\\envs\\FedML0502\\lib\\site-packages\\fedml\\config/simulaton_mpi/gpu_mapping.yaml', 'gpu_mapping_key':
 'mapping_default', 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr'}
[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:19:46] [INFO] [gpu_mapping.py:13:mapping_processes_to_gpu_device_from_yaml_file]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:19:46] [INFO] [gpu_mapping.py:15:mapping_processes_to_gpu_device_from_yaml_file]  ################## You do not indicate gpu_util_file, will use CPU training  #################
[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:19:46] [INFO] [gpu_mapping.py:17:mapping_processes_to_gpu_device_from_yaml_file] cpu
[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:19:46] [INFO] [data_loader.py:22:download_mnist] ./data/mnist/MNIST.zip
[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:19:52] [INFO] [data_loader.py:57:load_synthetic_data] load_data. dataset_name = mnist
[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:20:33] [INFO] [data_loader.py:126:load_partition_data_mnist] loading data...
[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:20:38] [INFO] [data_loader.py:144:load_partition_data_mnist] finished the loading data
[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:20:39] [INFO] [model_hub.py:16:create] create_model. model_name = lr, output_dim = 10
[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:20:39] [INFO] [model_hub.py:19:create] LogisticRegression + MNIST
[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:20:39] [INFO] [FedAVGAggregator.py:112:client_sampling] client_indexes = [993 859 298 553]

test/fedml_user_code/cross_device

  1. adb push the data to your Android device

    This part is done by manually downloading the dataset and then transferring the data to the phone via ADB

  2. Launch Android Device, and bind the Android Device to open.fedml.ai.

  3. check the device ID at open.fedml.ai, and change the edge ID at the test scripts

  4. start the python server at python/examples/cross_device/mqtt_s3_fedavg_mnist_lr_example/custum_data_and_model/

bash run_server.sh

The exception is in this step, when I finish the previous steps, run python torch_server.py --cf . /config/fedml_config.yaml --rank 0 --run_id 189,The error is as follows:

ThinkPad@LAPTOP-M816KBBA MINGW64 /g/python projects/FedML/test/fedml_user_code/cross_device (master)
$ python torch_server.py --cf ./config/fedml_config.yaml --rank 0 --run_id 189

[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:00:42] [INFO] [__init__.py:30:init] args = {'yaml_config_file': './config/fedml_config.yaml', 'run_id': '189', 'rank': 0, 'yaml_paths': ['./config/fedml_config.yaml'], 'training_type'
: 'cross_device', 'using_mlops': False, 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': '../../../data/mnist', 'partition_method': 'hetero', 'partition_alpha': 0.5, 'model': 'lr', 'model_file_cache_folder': './model_file_cache'
, 'global_model_file_path': './model_file_cache/global_model.mnn', 'federated_optimizer': 'FedAvg', 'client_id_list': '[150]', 'client_num_in_total': 1, 'client_num_per_round': 1, 'comm_round': 3, 'epochs': 1, 'batch_size': 100, 'batch
_num': -1, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weight_decay': 0.001, 'frequency_of_the_test': 5, 'worker_num': 1, 'using_gpu': False, 'gpu_mapping_file': 'config/gpu_mapping.yaml', 'gpu_mapping_key': 'mapping_default', '
backend': 'MQTT_S3_MNN', 'mqtt_config_path': 'config/mqtt_config.yaml', 's3_config_path': 'config/s3_config.yaml', 'log_file_dir': './log', 'enable_wandb': False, 'wandb_obj': '', 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408'
, 'wandb_project': 'fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr'}
[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:00:42] [INFO] [device.py:43:get_device] device = cpu
Traceback (most recent call last):
  File "torch_server.py", line 32, in <module>
    create_mnn_lenet5_model(args.global_model_file_path)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\model\mobile\mnn_lenet.py", line 35, in create_mnn_lenet5_model
    F.save([predicts], mnn_file_path)
RuntimeError: Caught an unknown exception!

By debugging I found the specific line of code where the error occurred is the last line in the following code:

\lib\site-packages\fedml\model\mobile\mnn_lenet.py

def create_mnn_lenet5_model(mnn_file_path):
    net = Lenet5()
    input_var = MNN.expr.placeholder([1, 1, 28, 28], MNN.expr.NCHW)
    predicts = net.forward(input_var)
    F.save([predicts], mnn_file_path)

Maybe there is some problem with MNN library on windows

test/fedml_user_code/cross_silo

Follow the steps in the cross-silo README to run the server and both clients and get the following output:

server:

ThinkPad@LAPTOP-M816KBBA MINGW64 /g/python projects/FedML/test/fedml_user_code/cross_silo (master)
$ bash run_server.sh
......
register_message_receive_handlers------
[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:13:34] [INFO] [mqtt_s3_multi_clients_comm_manager.py:117:_on_connect_impl] mqtt_s3.on_connect: connection returned with result
code:0
[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:13:34] [INFO] [mqtt_s3_multi_clients_comm_manager.py:130:_on_connect_impl] mqtt_s3.on_connect: subscribes real_topic = fedml_0_
1, mid = 1, result = 0
Traceback (most recent call last):
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\communication\mqtt_s3\mqtt_s3_multi_clients_comm_manager.py", line 150, in _on_connect
    self._on_connect_impl(client, userdata, flags, rc)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\communication\mqtt_s3\mqtt_s3_multi_clients_comm_manager.py", line 123, in _on_connect_im
pl
    real_topic = self._topic + str(self.client_real_ids[client_rank])
IndexError: list index out of range
[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:13:35] [INFO] [mqtt_s3_multi_clients_comm_manager.py:159:_on_disconnect] mqtt_s3.on_disconnect: disconnection returned result 0
, user data None
[FedML-Server(0) @device-id-0] [Tue, 03 May 2022 16:13:35] [INFO] [mqtt_s3_status_manager.py:80:_on_disconnect] mqtt_s3.on_disconnect: disconnection returned result 0, user data None

client1

.....
[FedML-Client(1) @device-id-1] [Tue, 03 May 2022 16:14:08] [INFO] [mqtt_s3_multi_clients_comm_manager.py:322:send_message] mqtt_s3.send_message: MQTT msg sent
Traceback (most recent call last):
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\communication\mqtt_s3\mqtt_s3_multi_clients_comm_manager.py", line 150, in _on_connect
    self._on_connect_impl(client, userdata, flags, rc)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\communication\mqtt_s3\mqtt_s3_multi_clients_comm_manager.py", line 141, in _on_connect_im
pl
    self._notify_connection_ready()
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\communication\mqtt_s3\mqtt_s3_multi_clients_comm_manager.py", line 176, in _notify_connec
tion_ready
    observer.receive_message(msg_type, msg_params)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\core\distributed\client\client_manager.py", line 103, in receive_message
    handler_callback_func(msg_params)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\cross_silo\horizontal\fedml_client_manager.py", line 62, in handle_message_connection_ready
    self.sys_stats_process.start()
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'botocore.client.S3'>: attribute lookup S3 on botocore.client failed
[FedML-Client(1) @device-id-1] [Tue, 03 May 2022 16:14:08] [INFO] [mqtt_s3_multi_clients_comm_manager.py:159:_on_disconnect] mqtt_s3.on_disconnect: disconnection returned result 0
, user data None
[FedML-Client(1) @device-id-1] [Tue, 03 May 2022 16:14:08] [INFO] [mqtt_s3_status_manager.py:80:_on_disconnect] mqtt_s3.on_disconnect: disconnection returned result 0, user data None

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

client2

Traceback (most recent call last):
  File "client/torch_client.py", line 5, in <module>
    fedml.run_cross_silo_client()
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\__init__.py", line 143, in run_cross_silo_client
    client = ClientCrossSilo(args, device, dataset, model)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\cross_silo\client.py", line 16, in __init__
    preprocessed_sampling_lists=None,
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\cross_silo\horizontal\fedml_horizontal_api.py", line 60, in FedML_Horizontal
    model_trainer,
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\cross_silo\horizontal\fedml_horizontal_api.py", line 155, in init_client
    client_manager = FedMLClientManager(args, trainer, comm, client_rank, client_num, backend)
  File "D:\ProgramData\Miniconda3\envs\FedML0502\lib\site-packages\fedml\cross_silo\horizontal\fedml_client_manager.py", line 26, in __init__
    self.client_real_id = self.client_real_ids[self.get_sender_id() - 1]
IndexError: list index out of range
[FedML-Client(2) @device-id-1] [Tue, 03 May 2022 16:14:10] [INFO] [mqtt_s3_multi_clients_comm_manager.py:159:_on_disconnect] mqtt_s3.on_disconnect: disconnection returned result 0
, user data None
[FedML-Client(2) @device-id-1] [Tue, 03 May 2022 16:14:10] [INFO] [mqtt_s3_status_manager.py:80:_on_disconnect] mqtt_s3.on_disconnect: disconnection returned result 0, user data None

@chaoyanghe

Nicole456 avatar May 03 '22 08:05 Nicole456

@Nicole456 let's set a meeting today to do a live debugging.

chaoyanghe avatar May 03 '22 17:05 chaoyanghe

@chaoyanghe @Nicole456 Revisiting this issue. Has it been addressed?

fedml-dimitris avatar Oct 24 '23 20:10 fedml-dimitris