DALLE-pytorch
DALLE-pytorch copied to clipboard
CUDA out of memory in the middle of an epoch (deepspeed --amp)
Hey there!
I encountered a strange bug in the middle of an epoch - CUDA was out of memory in the middle of a run...
1 16370 loss - 3.8609836101531982
[2021-06-23 02:08:42,649] [INFO] [logging.py:60:log_dist] [Rank 0] step=861430, skipped=0, lr=[0.0004], mom=[(0.9, 0.999)]
[2021-06-23 02:08:42,654] [INFO] [timer.py:154:stop] 0/298060, SamplesPerSec=55.12261710734795
Traceback (most recent call last):
File "train_dalle.py", line 431, in <module>
distr_dalle.backward(loss)
File "/home/robert/.pyenv/versions/anaconda3-2020.11/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1005, in backward
scaled_loss.backward()
File "/home/robert/.pyenv/versions/anaconda3-2020.11/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/robert/.pyenv/versions/anaconda3-2020.11/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 1.93 GiB (GPU 0; 23.70 GiB total capacity; 15.29 GiB already allocated; 1.54 GiB free; 20.47 GiB reserved in total by PyTorch)
Has anyone encountered similar problems?
@robvanvolt
I suppose I've had my fair share of unexplained OOM's. Is there anything else specific about the run that you can think of? Is it happening on a consistent basis?
BTW do you have a nice curated output from your most recent checkpoint to use for the README.md in the new PR i'm working on? I just checked that checkpoint out and yeah - definitely the best one yet. Really cool.
The CUDA memory allocation values don't make sense to me.
Tried to allocate 1.93 GiB (GPU 0; 23.70 GiB total capacity; 15.29 GiB already allocated; 1.54 GiB free; 20.47 GiB reserved in total by PyTorch)
24 GB total - 20 GB reserved = 4 GB free 20 GB reserved - 15 GB allocated = 5 GB reserved free ... while they state 1.54 GB free. 5 GB reserved free - 2 GB to be allocated is obviously positive.
But yeah, if you're able to reproduce, that would be great! Maybe we're leaking GPU memory somewhere?
The CUDA memory allocation values don't make sense to me.
Tried to allocate 1.93 GiB (GPU 0; 23.70 GiB total capacity; 15.29 GiB already allocated; 1.54 GiB free; 20.47 GiB reserved in total by PyTorch)
24 GB total - 20 GB reserved = 4 GB free 20 GB reserved - 15 GB allocated = 5 GB reserved free ... while they state 1.54 GB free. 5 GB reserved free - 2 GB to be allocated is obviously positive.
But yeah, if you're able to reproduce, that would be great! Maybe we're leaking GPU memory somewhere?
Hard to reproduce, because it takes 3 days till the failure ocures - I try to keep out-of-training leakage to a minimum (closed gdm3/lightdm, no other critical software running, about from me logging in through ssh from time to time). Will see if I can pin down / reproduce the leakage in further trainings..:)