Semi-supervised-learning icon indicating copy to clipboard operation
Semi-supervised-learning copied to clipboard

About Multi-process Distributed Training

Open Betty-J opened this issue 1 year ago • 0 comments

Following your previous suggestions, I set up distributed computing during training and executed the command as shown in the figure below. 截屏2024-06-19 14 29 51

The corresponding settings in the config.yaml remained unchanged. However, the following situation occurred: the 0th GPU is always occupied by several extra processes. I tried debugging but did not find any clues. Is there anyone can give some suggestions? Because the extra occupation of the 0th GPU may cause an error: CUDA out of memory. 截屏2024-06-19 14 35 35

It is evident that this phenomenon is also related to the number of GPUs. 截屏2024-06-19 14 38 23

Betty-J avatar Jun 19 '24 06:06 Betty-J