examples
examples copied to clipboard
DDP: why does every process allocate memory of GPU 0 and how to avoid it?
Run this example with 2 GPUs. process 2 will allocate some memory on GPU 0.
python main.py --multiprocessing-distributed --world-size 1 --rank 0
I have carefully checked the sample code and there seems to be no obvious error that would cause process 2 to transfer data to GPU 0.
So:
- Why does process 2 allocate memory of GPU 0?
- Is this part of the data involved in the calculation? I think if this part of the data is involved in the calculation when the number of processes becomes large, it will cause GPU 0 to be seriously overloaded?
- Is there any way to avoid it?
Thanks in advance to partners in the PyTorch community for their hard work.
https://github.com/pytorch/examples/blob/0cb38ebb1b6e50426464b3485435c0c6affc2b65/imagenet/main.py#L310
loss.backward()
When I remove this line, process 1 no longer allocates memory on GPU 0, so it all happens when error backpropagation.
Does anyone have some insights?
Maybe you use the torch.load() without 'map_location=lambda storage, loc: storage'. The original checkpoint saved the tensor on different GPUs, then the torch.load() will also create another process to map the GPU.
This https://discuss.pytorch.org/t/extra-10gb-memory-on-gpu-0-in-ddp-tutorial/118113 solved the problem for me
torch.cuda.set_device(rank) torch.cuda.empty_cache()
This still doesn't seem to be helping in my case :-(
Just had the same problem and debugged it. You need to put torch.cuda.set_device(rank) before dist.init_process_group()