stylegan2-pytorch icon indicating copy to clipboard operation
stylegan2-pytorch copied to clipboard

Memory usage during multi GPU training

Open ferrophile opened this issue 3 years ago • 3 comments

I noticed that when training with multiple GPUs, sometimes the spawned processes for some reason seem to occupy memory space on the first GPU, i.e. the memory usage is not distributed equally.

Below is an example training with 7 gpus. All the processes are occupying some memory on GPU3. GPU3 is using 10GB RAM while other GPUs are only using 3GB.

(Only processes with "stylegan2-pytorch" is related to this repository. I'm not sure if other programs running on the GPUs are related to this issue)

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    3      8852      C   ...onda3/envs/stylegan2-pytorch/bin/python  3219MiB |
|    3      8853      C   ...onda3/envs/stylegan2-pytorch/bin/python  1205MiB |
|    3      8854      C   ...onda3/envs/stylegan2-pytorch/bin/python  1205MiB |
|    3      8855      C   ...onda3/envs/stylegan2-pytorch/bin/python  1205MiB |
|    3      8856      C   ...onda3/envs/stylegan2-pytorch/bin/python  1205MiB |
|    3      8857      C   ...onda3/envs/stylegan2-pytorch/bin/python  1205MiB |
|    3      8858      C   ...onda3/envs/stylegan2-pytorch/bin/python  1205MiB |
|    4      8853      C   ...onda3/envs/stylegan2-pytorch/bin/python  3255MiB |
|    5      8854      C   ...onda3/envs/stylegan2-pytorch/bin/python  3255MiB |
|    6      8855      C   ...onda3/envs/stylegan2-pytorch/bin/python  3255MiB |
|    6     35126      C   python                                      1191MiB |
|    6     41314      C   python                                      1191MiB |
|    7      8856      C   ...onda3/envs/stylegan2-pytorch/bin/python  3255MiB |
|    8      8857      C   ...onda3/envs/stylegan2-pytorch/bin/python  3255MiB |
|    8     36933      C   python                                      1191MiB |
|    8     41629      C   python                                      1191MiB |
|    9      8858      C   ...onda3/envs/stylegan2-pytorch/bin/python  3255MiB |
+-----------------------------------------------------------------------------+


Is this normal? While the program seems to run normally, I have had trouble exceeding the memory limit of the first GPU while other GPUs remain underused.

ferrophile avatar Feb 09 '21 13:02 ferrophile

Have you solved this problem? my situation is similar to yours. image

mlyarthur avatar Jun 09 '21 05:06 mlyarthur

Sorry I haven't solved it. I switched to using the following repository which can train with 1 GPU only. https://github.com/lucidrains/lightweight-gan

ferrophile avatar Jun 14 '21 07:06 ferrophile

same issue here

Jiangshuyi0V0 avatar Oct 02 '22 08:10 Jiangshuyi0V0