FedML icon indicating copy to clipboard operation
FedML copied to clipboard

fedml_experiments/distributed/fedavg run_fedavg_distributed_pytorch.sh stuck in one computer with one GPU

Open jackdoll opened this issue 3 years ago • 2 comments

I just have one computer with one GPU, I want to run three processes on one GPU to simulte one server and two clients, so I set the gpu_mapping.yaml as mapping_default: ChaoyangHe-GPU-RTX1080Ti: [3], and run "sh run_fedavg_distributed_pytorch.sh 2 2 resnet56 homo 1 1 64 0.001 cifar10 "./../../../data/cifar10" sgd 1". But it always has the error "mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated." How to solve the problem? Or can you tell me how to run distributed FL if I only have one computer with one GPU?

jackdoll avatar Jan 03 '22 13:01 jackdoll

Did you change the 'ChaoyangHe-GPU-RTX1080Ti' to your hostname?

KOUDA-AMINE avatar Feb 02 '22 12:02 KOUDA-AMINE

@jackdoll is this issue solved in the latest examples? https://github.com/FedML-AI/FedML/tree/master/python/examples

chaoyanghe avatar Aug 19 '22 17:08 chaoyanghe

Closing due to inactivity.

fedml-dimitris avatar Oct 24 '23 19:10 fedml-dimitris