您好。我发现您们的代码似乎只能在3个gpu的单机运行才行。否则出错
例如我在google colab上运行
!python fed_seed_run.py /content/drive/MyDrive/workspace/ fedavg mrpc adapter 1100 0,1,3
训练相当长时间,报错收不到客户消息超时退出。想改成!python fed_seed_run.py /content/drive/MyDrive/workspace/ fedavg mrpc adapter 1100 0 运行也报错
因为colab只能有一个gpu,不知您有没有办法让他在单gpu机器上运行?(或哪里可找到多gpu的云计算平台)
11-13/23:28:18|INFO |base_client.py:292|MRPC Train, Client:14, Loss:0.318, Accuracy:0.939
11-13/23:28:36|INFO |base_client.py:292|MRPC Train, Client:3, Loss:0.313, Accuracy:0.970
11-13/23:28:54|INFO |base_client.py:292|MRPC Train, Client:35, Loss:0.324, Accuracy:0.939
11-13/23:29:13|INFO |base_client.py:292|MRPC Train, Client:31, Loss:0.309, Accuracy:0.970
Traceback (most recent call last):
File "/content/drive/MyDrive/workspace/code/FedETuning/main.py", line 20, in
main()
File "/content/drive/MyDrive/workspace/code/FedETuning/main.py", line 16, in main
trainer.train()
File "/content/drive/MyDrive/workspace/code/FedETuning/trainers/FedBaseTrainer.py", line 91, in train
self.server_manger.run()
File "/content/drive/MyDrive/workspace/code/FedETuning/fedlab/core/network_manager.py", line 38, in run
self.main_loop()
File "/content/drive/MyDrive/workspace/code/FedETuning/trainers/BaseServer/base_server.py", line 253, in main_loop
sender_rank, message_code, payload = self._network.recv()
File "/content/drive/MyDrive/workspace/code/FedETuning/fedlab/core/network.py", line 102, in recv
sender_rank, message_code, content = PackageProcessor.recv_package(
File "/content/drive/MyDrive/workspace/code/FedETuning/fedlab/core/communicator/processor.py", line 118, in recv_package
sender_rank, _, slices_size, message_code, data_type = recv_header(
File "/content/drive/MyDrive/workspace/code/FedETuning/fedlab/core/communicator/processor.py", line 96, in recv_header
dist.recv(buffer, src=src)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1632, in recv
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 18000000ms for recv operation to complete
sh: 0: getcwd() failed: Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
sh: 0: getcwd() failed: Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
sh: 0: getcwd() failed: Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
sh: 0: getcwd() failed: Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
Exception ignored in: <function Pool.del at 0x7cf9930e17e0>
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 271, in del
File "/usr/lib/python3.10/multiprocessing/queues.py", line 371, in put
AttributeError: 'NoneType' object has no attribute 'dumps'