DALLE-pytorch icon indicating copy to clipboard operation
DALLE-pytorch copied to clipboard

CUDA out of memory in the middle of an epoch (deepspeed --amp)

Open robvanvolt opened this issue 3 years ago • 4 comments

Hey there!

I encountered a strange bug in the middle of an epoch - CUDA was out of memory in the middle of a run...

1 16370 loss - 3.8609836101531982
[2021-06-23 02:08:42,649] [INFO] [logging.py:60:log_dist] [Rank 0] step=861430, skipped=0, lr=[0.0004], mom=[(0.9, 0.999)]
[2021-06-23 02:08:42,654] [INFO] [timer.py:154:stop] 0/298060, SamplesPerSec=55.12261710734795
Traceback (most recent call last):
  File "train_dalle.py", line 431, in <module>
    distr_dalle.backward(loss)
  File "/home/robert/.pyenv/versions/anaconda3-2020.11/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1005, in backward
    scaled_loss.backward()
  File "/home/robert/.pyenv/versions/anaconda3-2020.11/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/robert/.pyenv/versions/anaconda3-2020.11/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 1.93 GiB (GPU 0; 23.70 GiB total capacity; 15.29 GiB already allocated; 1.54 GiB free; 20.47 GiB reserved in total by PyTorch) 

Has anyone encountered similar problems?

robvanvolt avatar Jun 23 '21 05:06 robvanvolt

@robvanvolt

I suppose I've had my fair share of unexplained OOM's. Is there anything else specific about the run that you can think of? Is it happening on a consistent basis?

afiaka87 avatar Jun 28 '21 05:06 afiaka87

BTW do you have a nice curated output from your most recent checkpoint to use for the README.md in the new PR i'm working on? I just checked that checkpoint out and yeah - definitely the best one yet. Really cool.

afiaka87 avatar Jun 28 '21 05:06 afiaka87

The CUDA memory allocation values don't make sense to me.

Tried to allocate 1.93 GiB (GPU 0; 23.70 GiB total capacity; 15.29 GiB already allocated; 1.54 GiB free; 20.47 GiB reserved in total by PyTorch)

24 GB total - 20 GB reserved = 4 GB free 20 GB reserved - 15 GB allocated = 5 GB reserved free ... while they state 1.54 GB free. 5 GB reserved free - 2 GB to be allocated is obviously positive.

But yeah, if you're able to reproduce, that would be great! Maybe we're leaking GPU memory somewhere?

janEbert avatar Jun 30 '21 13:06 janEbert

The CUDA memory allocation values don't make sense to me.

Tried to allocate 1.93 GiB (GPU 0; 23.70 GiB total capacity; 15.29 GiB already allocated; 1.54 GiB free; 20.47 GiB reserved in total by PyTorch)

24 GB total - 20 GB reserved = 4 GB free 20 GB reserved - 15 GB allocated = 5 GB reserved free ... while they state 1.54 GB free. 5 GB reserved free - 2 GB to be allocated is obviously positive.

But yeah, if you're able to reproduce, that would be great! Maybe we're leaking GPU memory somewhere?

Hard to reproduce, because it takes 3 days till the failure ocures - I try to keep out-of-training leakage to a minimum (closed gdm3/lightdm, no other critical software running, about from me logging in through ssh from time to time). Will see if I can pin down / reproduce the leakage in further trainings..:)

robvanvolt avatar Jul 01 '21 13:07 robvanvolt