FedML
FedML copied to clipboard
How to use the distributed training function of this FedML library in a HPC cluster managed with Slurm?
Thanks for @chaoyanghe to provide such a good open source library for the federated learning researches and learners.
Now I have a problem for using this FedML library in our high performance computing cluster. The HPC cluster of our institution is managed by the Slurm software. The users should first apply for computation nodes first before they run the task. And then the Slurm manager allocates the computation nodes to the computation task of this user. In this way, the user can not previously know the hostname of each node. So, we can not modify the "mpi_host_file" accordingly before we submit the computation task to the HPC.
This problem limits me to use the distributed training functions of this library and now I can only use the standalone version of this FedML library. I hope the authors @chaoyanghe could give me a possible solution for using this FedML library on the Slurm cluster.
@ZhangXiaoXuan2019 thank you for request. I will ask @alex-liang-kh to look at this issue.