vits Questions about 48k audio file train

My terminal shows weird output when I started single-speaker training on audios of sampling rate = 48000hz, after I finished the last round of single-speaker training with fine results on the same audios resampled to default sampling rate 22050hz.

After I run train.py, the terminal throws this message:

warning: audio amplitude out of range, auto clipped.

(I guess this wasn't the crucial problem?)

Then this message:

max value is tensor(33528.1016)
min value is tensor(-17584.6523)
max value is tensor(25380.4434)
min value is tensor(-38273.9297)
max value is tensor(50959.3125)
min value is tensor(-37103.1211)
max value is tensor(37702.8320)
min value is tensor(-33512.7734)

... for dozens of rows.

Then this:

[INFO] ====> Epoch: 1
/root/.local/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

Then the training shows weird losses like:

[INFO] Train Epoch: 21 [0%]
[INFO] [nan, nan, nan, nan, 2.0102107524871826, 208.6475067138672, 1600, 0.00019950059330492385]

each of the first four elements is nan. None of above happend with my previous 22050hz audio file training, so I'm wondering why and what I can do.(I've already modified json file in /configs to 48k sampling rate.) My apologies in advance if my questions were too basic.

Sep 17 '22 18:09 H4ppyB1rd

You may try 44.1 KHz, worked for me. (set in config.json: sampling_rate = 44100). Also make sure your audio is 1-channel 16-bits wave.

Sep 20 '22 08:09 nikich340

You may try 44.1 KHz, worked for me. (set in config.json: sampling_rate = 44100). Also make sure your audio is 1-channel 16-bits wave.

Works for me. Thx!

Sep 21 '22 02:09 H4ppyB1rd

@nikich340 does your speech synthesis have a good result? My result is ok but the quality of speech is not so good, there is still noise in it and some mispronounciation? Do you get the same problem?

Oct 03 '22 04:10 tuannvhust

@nikich340 does your speech synthesis have a good result? My result is ok but the quality of speech is not so good, there is still noise in it and some mispronounciation? Do you get the same problem?

Rarely, I use good dataset (16 hours). If you have less than 2 hours of speech lines don't expect stable good results.

Also I edited processing scripts, so it accept straight IPA phonemes input (I used ng-espeak ipa preprocessing). In case you want model to generate some specific word. Make sure you made unified input (I used punctuation signs .,?! and ..), got rid of another-language-words, quotes. Preprocessing should do it, but check manually anyway.

Oct 03 '22 04:10 nikich340

@nikich340 does your speech synthesis have a good result? My result is ok but the quality of speech is not so good, there is still noise in it and some mispronounciation? Do you get the same problem?

Rarely, I use good dataset (16 hours). If you have less than 2 hours of speech lines don't expect stable good results.

Also I edited processing scripts, so it accept straight IPA phonemes input (I used ng-espeak ipa preprocessing). In case you want model to generate some specific word. Make sure you made unified input (I used punctuation signs .,?! and ..), got rid of another-language-words, quotes. Preprocessing should do it, but check manually anyway.

22050hz model produces low-quality speech (frequency range under 11k) which can be checked using Adobe Audition or mel spectrogram.

I wonder if the 44100hz model can produce a wider frequency range like 22k? Thanks in advance.

low-frequency

Dec 02 '22 01:12 codexq123

Hello @nikich340

I m trying training an 8000Hz with 2 hours of data and changed it in the config file before training but my audio seems like it's mumbling, not speaking properly.

Here is the sample of original recording

Also the generated audio sound like this

Can you suggest what is wrong with it?

Aug 11 '23 11:08 athenasaurav

vits vits copied to clipboard

Questions about 48k audio file train

vits
vits copied to clipboard