FedML icon indicating copy to clipboard operation
FedML copied to clipboard

fedml_experiments/distributed/fedavg run_fedavg_distributed_pytorch.sh stuck

Open wangzhenzhou2020 opened this issue 4 years ago • 8 comments

find the problem, for every worker(process): image

eventually, it stucks at: image

wangzhenzhou2020 avatar Jun 07 '21 03:06 wangzhenzhou2020

anybody encountered this problems?

wangzhenzhou2020 avatar Jun 07 '21 11:06 wangzhenzhou2020

Please comment or delete line 62-64 in FedAvgClientManager.py and line 59-63 in FedAvgServerManager.py. This will fix your problem

zhang-tuo-pdf avatar Jun 30 '21 18:06 zhang-tuo-pdf

thanks. i've solved that problem (just as you said) two weeks ago. But , i've encounted another problem: at the final round, FedAvgServerManager.py call FedAVGAggregator.py(the red arrow) . image

But in the FedAvgAggregator.py , test_on_server_for_all_clients(), sometimes , the last line , logging.info(stats) , can't log the test_acc and test_loss. image

At the last line , it logs the test_acc and test_loss. the here 1 ... here 6 are used for helping me debug.

Now , first, i show the normal ways the code ends:

image You can see the test_acc and test_loss (In FedAvgAggregator.py ). and __finish (In FedAvgServerManager.py )

But, sometimes the code ends weird: image

Or: image

It may ends at here 1 ...... here 6 . It don’t log the test_acc and test_loss.

Do you have the same problem?

wangzhenzhou2020 avatar Jul 01 '21 03:07 wangzhenzhou2020

怎么解决,对应不上,这个文件在什么下面啊

897123856413 avatar Nov 28 '21 11:11 897123856413

代码行数不对应啊

,还是卡着呢

897123856413 avatar Nov 28 '21 11:11 897123856413

Please comment or delete line 62-64 in FedAvgClientManager.py and line 59-63 in FedAvgServerManager.py. This will fix your problem还是不行,卡在中间呢

897123856413 avatar Nov 29 '21 04:11 897123856413

I want to ask you how you run run_fedavg_distributed_pytorch.sh successfully? I tried to run it on a single computer with a single GPU, but it always told me the error "mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated."

jackdoll avatar Jan 03 '22 13:01 jackdoll

@wangzhenzhou2020 I have a question about distribution, why gpu usage is very strange, CUDA0 is the fastest and CUDA3 is the slowest, there is a 10-fold difference in speed, and slow Gpus will run out of memory later.

rG223 avatar Feb 03 '22 06:02 rG223

@rG223 @jackdoll @897123856413

please check our new examples at: https://github.com/FedML-AI/FedML/tree/master/python/examples

We've upgraded our library a lot in recent version. Here is a brief introduction: https://medium.com/@FedML/fedml-ai-platform-releases-the-worlds-federated-learning-open-platform-on-public-cloud-with-an-8024e68a70b6

chaoyanghe avatar Aug 17 '22 00:08 chaoyanghe