vall-e icon indicating copy to clipboard operation
vall-e copied to clipboard

Cuda OOM error when "saving batch"

Open RuntimeRacer opened this issue 2 years ago • 20 comments

Just ran into this midst of training. I assume maybe the epoch ended and it tried to save something to disk, which is only few MB in size however.

2023-05-01 13:03:54,309 INFO [trainer.py:757] Epoch 1, batch 164100, train_loss[loss=2.413, ArTop10Accuracy=0.7735, over 4777.00 frames. ], tot_loss[loss=2.65, ArTop10Accuracy=0.7551, over 5235.90 frames. ], batch size: 17, lr: 8.73e-03
2023-05-01 13:04:19,027 INFO [trainer.py:757] Epoch 1, batch 164200, train_loss[loss=2.595, ArTop10Accuracy=0.7517, over 4929.00 frames. ], tot_loss[loss=2.66, ArTop10Accuracy=0.7534, over 5227.10 frames. ], batch size: 16, lr: 8.72e-03
2023-05-01 13:04:43,635 INFO [trainer.py:757] Epoch 1, batch 164300, train_loss[loss=2.718, ArTop10Accuracy=0.752, over 5319.00 frames. ], tot_loss[loss=2.662, ArTop10Accuracy=0.7525, over 5215.61 frames. ], batch size: 13, lr: 8.72e-03
2023-05-01 13:05:08,249 INFO [trainer.py:757] Epoch 1, batch 164400, train_loss[loss=2.874, ArTop10Accuracy=0.7315, over 5625.00 frames. ], tot_loss[loss=2.665, ArTop10Accuracy=0.7518, over 5210.18 frames. ], batch size: 12, lr: 8.72e-03
2023-05-01 13:05:32,968 INFO [trainer.py:757] Epoch 1, batch 164500, train_loss[loss=2.744, ArTop10Accuracy=0.7321, over 5302.00 frames. ], tot_loss[loss=2.675, ArTop10Accuracy=0.7504, over 5204.26 frames. ], batch size: 13, lr: 8.72e-03
2023-05-01 13:05:58,669 INFO [trainer.py:757] Epoch 1, batch 164600, train_loss[loss=2.714, ArTop10Accuracy=0.7356, over 5771.00 frames. ], tot_loss[loss=2.679, ArTop10Accuracy=0.7497, over 5181.83 frames. ], batch size: 14, lr: 8.71e-03
2023-05-01 13:06:11,613 INFO [trainer.py:1081] Saving batch to exp/valle/batch-bdd640fb-0667-1ad1-1c80-317fa3b1799d.pt
Traceback (most recent call last):
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1150, in <module>
    main()
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1143, in main
    run(rank=0, world_size=1, args=args)
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1032, in run
    train_one_epoch(
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 669, in train_one_epoch
    scaler.scale(loss).backward()
  File "/home/runtimeracer/anaconda3/envs/kajispeech2/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/runtimeracer/anaconda3/envs/kajispeech2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 640.00 MiB (GPU 0; 23.69 GiB total capacity; 20.74 GiB already allocated; 517.81 MiB free; 22.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for
Memory Management and PYTORCH_CUDA_ALLOC_CONF

I'll continue with lowering max-duration from 80 to 60 now.

RuntimeRacer avatar May 01 '23 14:05 RuntimeRacer

Can confirm this also happens with max-duration 60 after exactly 164.500 steps again. Probably Epoch switch, which however breaks the process. Not sure what it tries to save there, but it fill up VRAM completely.

2023-05-02 02:17:29,934 INFO [trainer.py:757] Epoch 1, batch 164100, train_loss[loss=2.378, ArTop10Accuracy=0.7807, over 4777.00 frames. ], tot_loss[loss=2.62, ArTop10Accuracy=0.7615, over 5235.90 frames. ], batch size: 17, lr: 6.21e-03
2023-05-02 02:17:54,606 INFO [trainer.py:757] Epoch 1, batch 164200, train_loss[loss=2.57, ArTop10Accuracy=0.7547, over 4929.00 frames. ], tot_loss[loss=2.63, ArTop10Accuracy=0.7598, over 5227.10 frames. ], batch size: 16, lr: 6.21e-03
2023-05-02 02:18:19,167 INFO [trainer.py:757] Epoch 1, batch 164300, train_loss[loss=2.688, ArTop10Accuracy=0.7569, over 5319.00 frames. ], tot_loss[loss=2.632, ArTop10Accuracy=0.7586, over 5215.61 frames. ], batch size: 13, lr: 6.21e-03
2023-05-02 02:18:43,727 INFO [trainer.py:757] Epoch 1, batch 164400, train_loss[loss=2.85, ArTop10Accuracy=0.7406, over 5625.00 frames. ], tot_loss[loss=2.635, ArTop10Accuracy=0.758, over 5210.18 frames. ], batch size: 12, lr: 6.21e-03
2023-05-02 02:19:08,370 INFO [trainer.py:757] Epoch 1, batch 164500, train_loss[loss=2.717, ArTop10Accuracy=0.737, over 5302.00 frames. ], tot_loss[loss=2.644, ArTop10Accuracy=0.7566, over 5204.26 frames. ], batch size: 13, lr: 6.21e-03
2023-05-02 02:19:19,203 INFO [trainer.py:1081] Saving batch to exp/valle/batch-bdd640fb-0667-1ad1-1c80-317fa3b1799d.pt
Traceback (most recent call last):
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1150, in <module>
    main()
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1143, in main
    run(rank=0, world_size=1, args=args)
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1032, in run
    train_one_epoch(
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 669, in train_one_epoch
    scaler.scale(loss).backward()
  File "/home/runtimeracer/anaconda3/envs/kajispeech2/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/runtimeracer/anaconda3/envs/kajispeech2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 580.00 MiB (GPU 0; 23.69 GiB total capacity; 20.86 GiB already allocated; 515.81 MiB free; 22.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

RuntimeRacer avatar May 02 '23 06:05 RuntimeRacer

Might also be related to the Issue I mentioned here in combination with Japanese / Chinese symbols, since my dataset contains these: https://github.com/lifeiteng/vall-e/issues/94#issuecomment-1530323074

I'll try another iteration on the model later without these languages to see if that makes any difference.

RuntimeRacer avatar May 02 '23 07:05 RuntimeRacer

I think Vall-E X is for multi language support . Not sure if Vall E can learn multiple languages

nivibilla avatar May 02 '23 08:05 nivibilla

From my testing yesterday it was able to transfer dialect from input sample to output even if it's a different language. What Vall-E X does is actually getting rid of dialect, or rather, converting the dialect from source language to target language.

The Issue with OOM I believe comes from the model implementation not being able to handle the symbols. I've seen there is this PR for adding chinese language support with a giant wordlist and G2PBackend, which I believe is needed to convert the symbol language to words that can be properly phonemized internally.

Because yesterday I tried Inference with sth. like this:

  1. 这是一个中文句子 ("this is a chinese Sentence") -> Gets OOM error
  2. Zhè shì yīgè zhōngwén jùzi ("this is a chinese Sentence", latin representation as simpler, phonemizable words) -> Works

I believe this is what G2PBackend actually does internally, but I could be mistaken.

RuntimeRacer avatar May 02 '23 09:05 RuntimeRacer

Ah I see. Makes sense.

nivibilla avatar May 02 '23 10:05 nivibilla

From my testing yesterday it was able to transfer dialect from input sample to output even if it's a different language.

Do you mean we can train it on multiple languages already?

What Vall-E X does is actually getting rid of dialect, or rather, converting the dialect from source language to target language.

Let me know if I get it right. Does this mean that input is lang A and output is lang B? I thought that the language id would control the accent instead as in "Learning to Speak Foreign Language Fluently".

RahulBhalley avatar May 03 '23 22:05 RahulBhalley

Do you mean we can train it on multiple languages already?

I have a PR open for the Commonvoice Dataset and it is currently training on 24 different languages in my AI training machine. Unfortunately I still cannot be sure it trains on the full dataset, because I hit this OOM error after ~164.500 steps each time; however I implemented some code and hopefully it is fixed next time it hits the issue.

Let me know if I get it right. Does this mean that input is lang A and output is lang B? I thought that the language id would control the accent instead as in "Learning to Speak Foreign Language Fluently".

I might understood that wrong as well. Just revisited their Github Page, I think they actually can control the accent explicitly. Just in most examples they just switch from english to chinese and backwards, so it's hard to assume if it could for example do french with german accent very well. It probably can.

RuntimeRacer avatar May 04 '23 00:05 RuntimeRacer

Further debugging the issue revealed that the crash seems to happen frequently if there are Cyrillic letters in the batch; processing these takes considerably more time and VRAM. It's less broken than with Chinese / Japanese symbols, but it behaves similar. So I believe the model also needs a phonetic conversion backend for these. Not sure in the first place though, why it has these issues especially with cyrillic letters, since those are just another alphabet not too different from latin. I will however strip all languages using a non-latin alphabet from my training dataset now, and see if that fixes the issue.

Also going to provide a PR with exception handling code later, which provides some verbosity output if the error is hit and (tries to) skip broken batches and continue training.

RuntimeRacer avatar May 04 '23 08:05 RuntimeRacer

@RuntimeRacer do you think it's useful to preprocess the data into phonemes and then give that to vall e. I feel like this would solve a lot of the OOM errors

nivibilla avatar May 04 '23 08:05 nivibilla

@nivibilla I did follow the exact process of dataset preparation for my CommonVoice training; I assume the phoneme conversion has already happened; also it apparently IS able to process these letters and symbols, but the performance in generation as well as the memory footprint is incredibly worse compared to latin.

Elle est toujours utilisée par Réseau ferré italien pour le service de l'infrastructure. -> This takes ~2 seconds in inference and 6.3 GB of VRAM Әмма уңай тәэсирләр алып килүче мәгълүмат белән бергә, негатив хәбәрләр дә таратыла. -> This takes ~20 seconds in inference and 15 GB of VRAM

RuntimeRacer avatar May 04 '23 08:05 RuntimeRacer

Ah right, in inference we are using a lot of vram for phoneme conversion? That's strange.

nivibilla avatar May 04 '23 09:05 nivibilla

Im finding it difficult to understand why there is a vram difference when using different languages. When converting to phonemes why is there a difference in VRAM usage? I assume after converting, there should be no difference what language it is. Unless the converted phonemes are much longer when its not English?

nivibilla avatar May 04 '23 09:05 nivibilla

I can share my symbols file later. But I don't think these are very different; I had a look at them before starting training

RuntimeRacer avatar May 04 '23 09:05 RuntimeRacer

Almost forgot I wanted to share my symbols file from phonemization:

<eps> 0
! 1
" 2
( 3
) 4
, 5
. 6
1 7
: 8
; 9
? 10
_ 11
a 12
aɪ 13
aɪə 14
aɪɚ 15
aʊ 16
b 17
bn 18
d 19
dʑ 20
dʒ 21
e 22
enus 23
eɪ 24
f 25
h 26
hi 27
hy 28
i 29
iə 30
iː 31
iːː 32
j 33
k 34
kh 35
ko 36
l 37
m 38
n 39
nʲ 40
o 41
oʊ 42
oː 43
oːɹ 44
p 45
pa 46
q 47
r 48
s 49
t 50
tw 51
tɕ 52
tʃ 53
tʰ 54
u 55
uː 56
v 57
w 58
x 59
z 60
¡ 61
« 62
» 63
¿ 64
æ 65
ææ 66
ç 67
ð 68
ŋ 69
ɐ 70
ɐɐ 71
ɑ 72
ɑː 73
ɑːɹ 74
ɒ 75
ɔ 76
ɔɪ 77
ɔː 78
ɔːɹ 79
ə 80
əl 81
əʊ 82
ɚ 83
ɛ 84
ɛɹ 85
ɛː 86
ɜː 87
ɡ 88
ɡʰ 89
ɡʲ 90
ɪ 91
ɪɹ 92
ɪː 93
ɬ 94
ɯ 95
ɹ 96
ɾ 97
ʁ 98
ʃ 99
ʊ 100
ʊɹ 101
ʌ 102
ʌʌ 103
ʒ 104
ʔ 105
̃ 106
̩ 107
θ 108
ᵻ 109
— 110
“ 111
” 112
… 113

RuntimeRacer avatar May 04 '23 22:05 RuntimeRacer

try lower --filter-max-duration 20(default value) to --filter-max-duration 14, you can use python ./bin/display_manifest_statistics.py to get duration distribution.

lifeiteng avatar May 05 '23 09:05 lifeiteng

@lifeiteng I believe it is most likely a charset issue following my observations: https://github.com/lifeiteng/vall-e/issues/110#issuecomment-1534338087 Synthesis in English, french and German seems to work however, so it's not an issue of multi-language training in general I believe.

Duration Disribution for CV with 24 languages (including languages with Cyrillic, Chinese and Japanese Charset) I shared in my commit here: https://github.com/lifeiteng/vall-e/pull/111/files#diff-aaf4d0ff4603a6956d6a4834fd5df31c65f62e95cee609f435828504c31a82fa

I will share my intermediate training model to allow further testing once Commonvoice epoch 1 finshed.

RuntimeRacer avatar May 05 '23 11:05 RuntimeRacer

Have you increased the macro NUM_TEXT_TOKENS? Token ids larger than this macro will cause out of bound memory access. @lifeiteng How about making this macro configurable from commad line args, leaving its default value to 512.

https://github.com/lifeiteng/vall-e/blob/168ace89e0b61c09bd97b4b6b986e47efb6eef91/valle/models/macros.py#L2

chenjiasheng avatar May 14 '23 07:05 chenjiasheng

Further debugging the issue revealed that the crash seems to happen frequently if there are Cyrillic letters in the batch; processing these takes considerably more time and VRAM. It's less broken than with Chinese / Japanese symbols, but it behaves similar. So I believe the model also needs a phonetic conversion backend for these. Not sure in the first place though, why it has these issues especially with cyrillic letters, since those are just another alphabet not too different from latin. I will however strip all languages using a non-latin alphabet from my training dataset now, and see if that fixes the issue.

Also going to provide a PR with exception handling code later, which provides some verbosity output if the error is hit and (tries to) skip broken batches and continue training.

I'm running into this issue, as well. I thought I had stripped out the non-latin alphabet characters from my dataset, but I still run into the issue. It passes the:

Sanity check -- see if any of the batches in epoch 1 would cause OOM.

But then fails on a specific batch.

Are you also stripping punctuation? What are you doing to filter out the non-latin alphabet characters?

yonomitt avatar Jun 07 '23 13:06 yonomitt

Have you increased the macro NUM_TEXT_TOKENS? Token ids larger than this macro will cause out of bound memory access. @lifeiteng How about making this macro configurable from commad line args, leaving its default value to 512.

https://github.com/lifeiteng/vall-e/blob/168ace89e0b61c09bd97b4b6b986e47efb6eef91/valle/models/macros.py#L2

So I tested this again. I tried with Value 1024 and also 4096 now, but each time it starts breaking the training as soon as the first cyrillic sentence appears. I believe this is some encoding related issue.

EDIT: I will check if I can somehow apply this while reading in the datasets: https://pypi.org/project/anyascii/0.1.6/

RuntimeRacer avatar Jun 13 '23 14:06 RuntimeRacer

@RuntimeRacer Can you post how the text_tokens_lens and audio_features_lens compare for a Cyrillic sentence?

I think my OOMs were due to text that was way too long compared to the audio. So the text_tokens_lens was way longer than it should have been. To fix this, I've been doing a filter pass, which removes any data where:

(audio_features_lens / text_tokens_lens) < 1.0

Most good (English) data that I've spot checked seems to have a ratio around 6.0-6.5, but I've seen it as low as 4.0, too.

yonomitt avatar Jun 13 '23 15:06 yonomitt