eg3d
eg3d copied to clipboard
Unbalanced GPU memory consumption
Hi there,
I noticed the the GPU memory consumption during training is unbalanced. To be more specific, I used 8 GPUs for training. It seems that GPU 0 uses 13449 MB GPU memory while other 7 GPUs use 5828 MB GPU memory, which limits the batch size and a lot of GPU memory is wasted.
Sorry that I didn't have much knowledge of GPU memory allocation in pytorch. Does anyone know why?
https://discuss.pytorch.org/t/extra-10gb-memory-on-gpu-0-in-ddp-tutorial/118113
I put these two lines of code at the beginning of the training_loop function, and the problem is solved.
torch.cuda.set_device(rank)
torch.cuda.empty_cache()
Hope this will help you.
Thank you very much for your reply. However the GPU memory is still unbalanced after adding these two lines at here https://github.com/NVlabs/eg3d/blob/493076071bc400be3289dd6725ef1eef0a552a9d/eg3d/training/training_loop.py#L129
Am I adding to the correct place?
It may be a problem of dataloader when pin_memory flag is set to true. pytorch/pytorch#58626
I solved this by adding the following code here: https://github.com/NVlabs/eg3d/blob/493076071bc400be3289dd6725ef1eef0a552a9d/eg3d/training/training_loop.py#L132
torch.cuda.set_device(device)
torch.cuda.set_device(device)
This addition code helps me training EG3D on 2080TIx8 which only has 11019MB memory each GPU, thanks a lot!
torch.cuda.set_device(device)
This addition code helps me training EG3D on 2080TIx8 which only has 11019MB memory each GPU, thanks a lot!
Hi there! Have you successfully trained the model? Can you provide the information about the time consumed to train on 8 2080ti?
Hi there! Have you successfully trained the model? Can you provide the information about the time consumed to train on 8 2080ti?
around 38 s/kimg