FedPETuning icon indicating copy to clipboard operation
FedPETuning copied to clipboard

代码似乎只能在3个gpu的单机运行才行。否则出错。能否改成在单gpu单机上运行?

Open springfall2018 opened this issue 1 year ago • 0 comments

您好。我发现您们的代码似乎只能在3个gpu的单机运行才行。否则出错 例如我在google colab上运行 !python fed_seed_run.py /content/drive/MyDrive/workspace/ fedavg mrpc adapter 1100 0,1,3

训练相当长时间,报错收不到客户消息超时退出。想改成!python fed_seed_run.py /content/drive/MyDrive/workspace/ fedavg mrpc adapter 1100 0 运行也报错 因为colab只能有一个gpu,不知您有没有办法让他在单gpu机器上运行?(或哪里可找到多gpu的云计算平台)

11-13/23:28:18|INFO |base_client.py:292|MRPC Train, Client:14, Loss:0.318, Accuracy:0.939 11-13/23:28:36|INFO |base_client.py:292|MRPC Train, Client:3, Loss:0.313, Accuracy:0.970 11-13/23:28:54|INFO |base_client.py:292|MRPC Train, Client:35, Loss:0.324, Accuracy:0.939 11-13/23:29:13|INFO |base_client.py:292|MRPC Train, Client:31, Loss:0.309, Accuracy:0.970 Traceback (most recent call last): File "/content/drive/MyDrive/workspace/code/FedETuning/main.py", line 20, in main() File "/content/drive/MyDrive/workspace/code/FedETuning/main.py", line 16, in main trainer.train() File "/content/drive/MyDrive/workspace/code/FedETuning/trainers/FedBaseTrainer.py", line 91, in train self.server_manger.run() File "/content/drive/MyDrive/workspace/code/FedETuning/fedlab/core/network_manager.py", line 38, in run self.main_loop() File "/content/drive/MyDrive/workspace/code/FedETuning/trainers/BaseServer/base_server.py", line 253, in main_loop sender_rank, message_code, payload = self._network.recv() File "/content/drive/MyDrive/workspace/code/FedETuning/fedlab/core/network.py", line 102, in recv sender_rank, message_code, content = PackageProcessor.recv_package( File "/content/drive/MyDrive/workspace/code/FedETuning/fedlab/core/communicator/processor.py", line 118, in recv_package sender_rank, _, slices_size, message_code, data_type = recv_header( File "/content/drive/MyDrive/workspace/code/FedETuning/fedlab/core/communicator/processor.py", line 96, in recv_header dist.recv(buffer, src=src) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1632, in recv work.wait() RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 18000000ms for recv operation to complete sh: 0: getcwd() failed: Transport endpoint is not connected python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected sh: 0: getcwd() failed: Transport endpoint is not connected python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected sh: 0: getcwd() failed: Transport endpoint is not connected python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected sh: 0: getcwd() failed: Transport endpoint is not connected python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected Exception ignored in: <function Pool.del at 0x7cf9930e17e0> Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/pool.py", line 271, in del File "/usr/lib/python3.10/multiprocessing/queues.py", line 371, in put AttributeError: 'NoneType' object has no attribute 'dumps'

springfall2018 avatar Nov 16 '23 05:11 springfall2018