stylegan2-pytorch
stylegan2-pytorch copied to clipboard
Memory usage during multi GPU training
I noticed that when training with multiple GPUs, sometimes the spawned processes for some reason seem to occupy memory space on the first GPU, i.e. the memory usage is not distributed equally.
Below is an example training with 7 gpus. All the processes are occupying some memory on GPU3. GPU3 is using 10GB RAM while other GPUs are only using 3GB.
(Only processes with "stylegan2-pytorch" is related to this repository. I'm not sure if other programs running on the GPUs are related to this issue)
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 3 8852 C ...onda3/envs/stylegan2-pytorch/bin/python 3219MiB |
| 3 8853 C ...onda3/envs/stylegan2-pytorch/bin/python 1205MiB |
| 3 8854 C ...onda3/envs/stylegan2-pytorch/bin/python 1205MiB |
| 3 8855 C ...onda3/envs/stylegan2-pytorch/bin/python 1205MiB |
| 3 8856 C ...onda3/envs/stylegan2-pytorch/bin/python 1205MiB |
| 3 8857 C ...onda3/envs/stylegan2-pytorch/bin/python 1205MiB |
| 3 8858 C ...onda3/envs/stylegan2-pytorch/bin/python 1205MiB |
| 4 8853 C ...onda3/envs/stylegan2-pytorch/bin/python 3255MiB |
| 5 8854 C ...onda3/envs/stylegan2-pytorch/bin/python 3255MiB |
| 6 8855 C ...onda3/envs/stylegan2-pytorch/bin/python 3255MiB |
| 6 35126 C python 1191MiB |
| 6 41314 C python 1191MiB |
| 7 8856 C ...onda3/envs/stylegan2-pytorch/bin/python 3255MiB |
| 8 8857 C ...onda3/envs/stylegan2-pytorch/bin/python 3255MiB |
| 8 36933 C python 1191MiB |
| 8 41629 C python 1191MiB |
| 9 8858 C ...onda3/envs/stylegan2-pytorch/bin/python 3255MiB |
+-----------------------------------------------------------------------------+
Is this normal? While the program seems to run normally, I have had trouble exceeding the memory limit of the first GPU while other GPUs remain underused.
Have you solved this problem? my situation is similar to yours.
Sorry I haven't solved it. I switched to using the following repository which can train with 1 GPU only. https://github.com/lucidrains/lightweight-gan
same issue here