denoising-diffusion-pytorch icon indicating copy to clipboard operation
denoising-diffusion-pytorch copied to clipboard

Running out of memory

Open andrewmoise opened this issue 3 years ago • 1 comments

Any advice on how to deal with running out of GPU memory? I'm just getting started with pytorch / this package, and this is what happens when I try an initial test run using 7000 steps (57000 training images, size 128x128, on a GPU with 15GB memory):

>>> from denoising_diffusion_pytorch import Unet, GaussianDiffusion, Trainer
>>> 
>>> model = Unet(
...     dim = 64,
...     dim_mults = (1, 2, 4, 8)
... ).cuda()
>>> 
>>> diffusion = GaussianDiffusion(
...     model,
...     image_size = 128,
...     timesteps = 1000,   # number of steps                                           
...     loss_type = 'l1'    # L1 or L2                                                  
... ).cuda()
>>> trainer = Trainer(
...     diffusion,
...     'training-set-2',
...     train_batch_size = 32,
...     train_lr = 2e-5,
...     train_num_steps = 7000,         # total training steps                          
...     gradient_accumulate_every = 2,    # gradient accumulation steps                 
...     ema_decay = 0.995,                # exponential moving average decay            
...     amp = True                        # turn on mixed precision                     
... )
>>> 
>>> trainer.train()
sampling loop time step: 100%|██████████████████| 1000/1000 [08:45<00:00,  1.90it/s]
loss: 0.2902:  14%|███▊                       | 1001/7000 [55:22<5:31:53,  3.32s/it]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.7/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 823, in train
    self.accelerator.backward(loss)
  File "/opt/conda/lib/python3.7/site-packages/accelerate/accelerator.py", line 884, in backward
    loss.backward(**kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 14.56 GiB total capacity; 13.02 GiB already allocated; 84.44 MiB free; 13.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

andrewmoise avatar Aug 11 '22 17:08 andrewmoise

Update: I modified Trainer.train() to delete intermediate data (the loss history and the sample image stuff) once it's done with it, and it's survived past the point when it was running out of memory before. I'll play with it a little more and then send a PR if that sounds okay.

andrewmoise avatar Aug 11 '22 20:08 andrewmoise