vall-e icon indicating copy to clipboard operation
vall-e copied to clipboard

Cuda OOM error when "saving batch"

Open RuntimeRacer opened this issue 1 year ago • 20 comments

Just ran into this midst of training. I assume maybe the epoch ended and it tried to save something to disk, which is only few MB in size however.

2023-05-01 13:03:54,309 INFO [trainer.py:757] Epoch 1, batch 164100, train_loss[loss=2.413, ArTop10Accuracy=0.7735, over 4777.00 frames. ], tot_loss[loss=2.65, ArTop10Accuracy=0.7551, over 5235.90 frames. ], batch size: 17, lr: 8.73e-03
2023-05-01 13:04:19,027 INFO [trainer.py:757] Epoch 1, batch 164200, train_loss[loss=2.595, ArTop10Accuracy=0.7517, over 4929.00 frames. ], tot_loss[loss=2.66, ArTop10Accuracy=0.7534, over 5227.10 frames. ], batch size: 16, lr: 8.72e-03
2023-05-01 13:04:43,635 INFO [trainer.py:757] Epoch 1, batch 164300, train_loss[loss=2.718, ArTop10Accuracy=0.752, over 5319.00 frames. ], tot_loss[loss=2.662, ArTop10Accuracy=0.7525, over 5215.61 frames. ], batch size: 13, lr: 8.72e-03
2023-05-01 13:05:08,249 INFO [trainer.py:757] Epoch 1, batch 164400, train_loss[loss=2.874, ArTop10Accuracy=0.7315, over 5625.00 frames. ], tot_loss[loss=2.665, ArTop10Accuracy=0.7518, over 5210.18 frames. ], batch size: 12, lr: 8.72e-03
2023-05-01 13:05:32,968 INFO [trainer.py:757] Epoch 1, batch 164500, train_loss[loss=2.744, ArTop10Accuracy=0.7321, over 5302.00 frames. ], tot_loss[loss=2.675, ArTop10Accuracy=0.7504, over 5204.26 frames. ], batch size: 13, lr: 8.72e-03
2023-05-01 13:05:58,669 INFO [trainer.py:757] Epoch 1, batch 164600, train_loss[loss=2.714, ArTop10Accuracy=0.7356, over 5771.00 frames. ], tot_loss[loss=2.679, ArTop10Accuracy=0.7497, over 5181.83 frames. ], batch size: 14, lr: 8.71e-03
2023-05-01 13:06:11,613 INFO [trainer.py:1081] Saving batch to exp/valle/batch-bdd640fb-0667-1ad1-1c80-317fa3b1799d.pt
Traceback (most recent call last):
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1150, in <module>
    main()
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1143, in main
    run(rank=0, world_size=1, args=args)
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1032, in run
    train_one_epoch(
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 669, in train_one_epoch
    scaler.scale(loss).backward()
  File "/home/runtimeracer/anaconda3/envs/kajispeech2/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/runtimeracer/anaconda3/envs/kajispeech2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 640.00 MiB (GPU 0; 23.69 GiB total capacity; 20.74 GiB already allocated; 517.81 MiB free; 22.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for
Memory Management and PYTORCH_CUDA_ALLOC_CONF

I'll continue with lowering max-duration from 80 to 60 now.

RuntimeRacer avatar May 01 '23 14:05 RuntimeRacer