vall-e
vall-e copied to clipboard
Cuda OOM error when "saving batch"
Just ran into this midst of training. I assume maybe the epoch ended and it tried to save something to disk, which is only few MB in size however.
2023-05-01 13:03:54,309 INFO [trainer.py:757] Epoch 1, batch 164100, train_loss[loss=2.413, ArTop10Accuracy=0.7735, over 4777.00 frames. ], tot_loss[loss=2.65, ArTop10Accuracy=0.7551, over 5235.90 frames. ], batch size: 17, lr: 8.73e-03
2023-05-01 13:04:19,027 INFO [trainer.py:757] Epoch 1, batch 164200, train_loss[loss=2.595, ArTop10Accuracy=0.7517, over 4929.00 frames. ], tot_loss[loss=2.66, ArTop10Accuracy=0.7534, over 5227.10 frames. ], batch size: 16, lr: 8.72e-03
2023-05-01 13:04:43,635 INFO [trainer.py:757] Epoch 1, batch 164300, train_loss[loss=2.718, ArTop10Accuracy=0.752, over 5319.00 frames. ], tot_loss[loss=2.662, ArTop10Accuracy=0.7525, over 5215.61 frames. ], batch size: 13, lr: 8.72e-03
2023-05-01 13:05:08,249 INFO [trainer.py:757] Epoch 1, batch 164400, train_loss[loss=2.874, ArTop10Accuracy=0.7315, over 5625.00 frames. ], tot_loss[loss=2.665, ArTop10Accuracy=0.7518, over 5210.18 frames. ], batch size: 12, lr: 8.72e-03
2023-05-01 13:05:32,968 INFO [trainer.py:757] Epoch 1, batch 164500, train_loss[loss=2.744, ArTop10Accuracy=0.7321, over 5302.00 frames. ], tot_loss[loss=2.675, ArTop10Accuracy=0.7504, over 5204.26 frames. ], batch size: 13, lr: 8.72e-03
2023-05-01 13:05:58,669 INFO [trainer.py:757] Epoch 1, batch 164600, train_loss[loss=2.714, ArTop10Accuracy=0.7356, over 5771.00 frames. ], tot_loss[loss=2.679, ArTop10Accuracy=0.7497, over 5181.83 frames. ], batch size: 14, lr: 8.71e-03
2023-05-01 13:06:11,613 INFO [trainer.py:1081] Saving batch to exp/valle/batch-bdd640fb-0667-1ad1-1c80-317fa3b1799d.pt
Traceback (most recent call last):
File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1150, in <module>
main()
File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1143, in main
run(rank=0, world_size=1, args=args)
File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1032, in run
train_one_epoch(
File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 669, in train_one_epoch
scaler.scale(loss).backward()
File "/home/runtimeracer/anaconda3/envs/kajispeech2/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/runtimeracer/anaconda3/envs/kajispeech2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 640.00 MiB (GPU 0; 23.69 GiB total capacity; 20.74 GiB already allocated; 517.81 MiB free; 22.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for
Memory Management and PYTORCH_CUDA_ALLOC_CONF
I'll continue with lowering max-duration
from 80 to 60 now.