eg3d icon indicating copy to clipboard operation
eg3d copied to clipboard

Unbalanced GPU memory consumption

Open Michaelsqj opened this issue 2 years ago • 6 comments

Hi there,

I noticed the the GPU memory consumption during training is unbalanced. To be more specific, I used 8 GPUs for training. It seems that GPU 0 uses 13449 MB GPU memory while other 7 GPUs use 5828 MB GPU memory, which limits the batch size and a lot of GPU memory is wasted.

Sorry that I didn't have much knowledge of GPU memory allocation in pytorch. Does anyone know why?

Michaelsqj avatar Aug 15 '22 21:08 Michaelsqj

https://discuss.pytorch.org/t/extra-10gb-memory-on-gpu-0-in-ddp-tutorial/118113

I put these two lines of code at the beginning of the training_loop function, and the problem is solved.

torch.cuda.set_device(rank)
torch.cuda.empty_cache()

Hope this will help you.

XuefengLi1 avatar Aug 17 '22 07:08 XuefengLi1

Thank you very much for your reply. However the GPU memory is still unbalanced after adding these two lines at here https://github.com/NVlabs/eg3d/blob/493076071bc400be3289dd6725ef1eef0a552a9d/eg3d/training/training_loop.py#L129

image

Am I adding to the correct place?

Michaelsqj avatar Aug 19 '22 12:08 Michaelsqj

It may be a problem of dataloader when pin_memory flag is set to true. pytorch/pytorch#58626

I solved this by adding the following code here: https://github.com/NVlabs/eg3d/blob/493076071bc400be3289dd6725ef1eef0a552a9d/eg3d/training/training_loop.py#L132

torch.cuda.set_device(device)

llrtt avatar Sep 20 '22 02:09 llrtt

torch.cuda.set_device(device)

This addition code helps me training EG3D on 2080TIx8 which only has 11019MB memory each GPU, thanks a lot!

szh-bash avatar Apr 15 '23 10:04 szh-bash

torch.cuda.set_device(device)

This addition code helps me training EG3D on 2080TIx8 which only has 11019MB memory each GPU, thanks a lot!

Hi there! Have you successfully trained the model? Can you provide the information about the time consumed to train on 8 2080ti?

dafei-qin avatar Apr 18 '23 07:04 dafei-qin

Hi there! Have you successfully trained the model? Can you provide the information about the time consumed to train on 8 2080ti?

around 38 s/kimg

szh-bash avatar Apr 23 '23 10:04 szh-bash