MARCONet icon indicating copy to clipboard operation
MARCONet copied to clipboard

Multi-GPU training

Open wojiaoyanmin opened this issue 11 months ago • 3 comments

Hi, Multi-GPU training has been consistently failing. Would it be possible to provide a screenshot of 'pip list' to see the version of each package installed, or if there is an environment image file available?

wojiaoyanmin avatar Mar 29 '24 08:03 wojiaoyanmin

Hi, Multi-GPU training has been consistently failing. Would it be possible to provide a screenshot of 'pip list' to see the version of each package installed, or if there is an environment image file available?

Hi, you can show me the error you have. You can refer to the package that I use. s1 s2

csxmli2016 avatar Mar 29 '24 08:03 csxmli2016

THX, The problem I encountered is in multi-node, multi-GPU training. Single gpu training is Fine. image

wojiaoyanmin avatar Mar 29 '24 08:03 wojiaoyanmin

THX, The problem I encountered is in multi-node, multi-GPU training. Single gpu training is Fine. image

I am not sure about this problem. Maybe you can check whether the number of GPU IDs in CUDA_VISIBLE_DEVICES equals to the parameter nproc_per_node.

csxmli2016 avatar Mar 29 '24 08:03 csxmli2016