examples icon indicating copy to clipboard operation
examples copied to clipboard

DDP: why does every process allocate memory of GPU 0 and how to avoid it?

Open siaimes opened this issue 2 years ago • 7 comments

Run this example with 2 GPUs. process 2 will allocate some memory on GPU 0.

python main.py --multiprocessing-distributed --world-size 1 --rank 0

image

I have carefully checked the sample code and there seems to be no obvious error that would cause process 2 to transfer data to GPU 0.

So:

  1. Why does process 2 allocate memory of GPU 0?
  2. Is this part of the data involved in the calculation? I think if this part of the data is involved in the calculation when the number of processes becomes large, it will cause GPU 0 to be seriously overloaded?
  3. Is there any way to avoid it?

Thanks in advance to partners in the PyTorch community for their hard work.

siaimes avatar Mar 08 '22 13:03 siaimes

https://github.com/pytorch/examples/blob/0cb38ebb1b6e50426464b3485435c0c6affc2b65/imagenet/main.py#L310

        loss.backward()

When I remove this line, process 1 no longer allocates memory on GPU 0, so it all happens when error backpropagation.

Does anyone have some insights?

siaimes avatar Mar 18 '22 03:03 siaimes

Maybe you use the torch.load() without 'map_location=lambda storage, loc: storage'. The original checkpoint saved the tensor on different GPUs, then the torch.load() will also create another process to map the GPU.

GongZhengLi avatar Jul 28 '22 02:07 GongZhengLi

This https://discuss.pytorch.org/t/extra-10gb-memory-on-gpu-0-in-ddp-tutorial/118113 solved the problem for me

torch.cuda.set_device(rank) torch.cuda.empty_cache()

kensun0 avatar Oct 18 '22 08:10 kensun0

This still doesn't seem to be helping in my case :-(

bhattg avatar Feb 14 '23 20:02 bhattg

Just had the same problem and debugged it. You need to put torch.cuda.set_device(rank) before dist.init_process_group()

hieuhoang avatar Dec 03 '23 01:12 hieuhoang