vits icon indicating copy to clipboard operation
vits copied to clipboard

Questions about 48k audio file train

Open H4ppyB1rd opened this issue 2 years ago • 6 comments

My terminal shows weird output when I started single-speaker training on audios of sampling rate = 48000hz, after I finished the last round of single-speaker training with fine results on the same audios resampled to default sampling rate 22050hz.

After I run train.py, the terminal throws this message:

  • warning: audio amplitude out of range, auto clipped.

(I guess this wasn't the crucial problem?)

Then this message:

  • max value is tensor(33528.1016)
  • min value is tensor(-17584.6523)
  • max value is tensor(25380.4434)
  • min value is tensor(-38273.9297)
  • max value is tensor(50959.3125)
  • min value is tensor(-37103.1211)
  • max value is tensor(37702.8320)
  • min value is tensor(-33512.7734)

... for dozens of rows.

Then this:

  • [INFO] ====> Epoch: 1
  • /root/.local/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  • "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

Then the training shows weird losses like:

  • [INFO] Train Epoch: 21 [0%]
  • [INFO] [nan, nan, nan, nan, 2.0102107524871826, 208.6475067138672, 1600, 0.00019950059330492385]

each of the first four elements is nan. None of above happend with my previous 22050hz audio file training, so I'm wondering why and what I can do.(I've already modified json file in /configs to 48k sampling rate.) My apologies in advance if my questions were too basic.

H4ppyB1rd avatar Sep 17 '22 18:09 H4ppyB1rd

You may try 44.1 KHz, worked for me. (set in config.json: sampling_rate = 44100). Also make sure your audio is 1-channel 16-bits wave.

nikich340 avatar Sep 20 '22 08:09 nikich340

You may try 44.1 KHz, worked for me. (set in config.json: sampling_rate = 44100). Also make sure your audio is 1-channel 16-bits wave.

Works for me. Thx!

H4ppyB1rd avatar Sep 21 '22 02:09 H4ppyB1rd

@nikich340 does your speech synthesis have a good result? My result is ok but the quality of speech is not so good, there is still noise in it and some mispronounciation? Do you get the same problem?

tuannvhust avatar Oct 03 '22 04:10 tuannvhust

@nikich340 does your speech synthesis have a good result? My result is ok but the quality of speech is not so good, there is still noise in it and some mispronounciation? Do you get the same problem?

Rarely, I use good dataset (16 hours). If you have less than 2 hours of speech lines don't expect stable good results.

Also I edited processing scripts, so it accept straight IPA phonemes input (I used ng-espeak ipa preprocessing). In case you want model to generate some specific word. Make sure you made unified input (I used punctuation signs .,?! and ..), got rid of another-language-words, quotes. Preprocessing should do it, but check manually anyway.

nikich340 avatar Oct 03 '22 04:10 nikich340

@nikich340 does your speech synthesis have a good result? My result is ok but the quality of speech is not so good, there is still noise in it and some mispronounciation? Do you get the same problem?

Rarely, I use good dataset (16 hours). If you have less than 2 hours of speech lines don't expect stable good results.

Also I edited processing scripts, so it accept straight IPA phonemes input (I used ng-espeak ipa preprocessing). In case you want model to generate some specific word. Make sure you made unified input (I used punctuation signs .,?! and ..), got rid of another-language-words, quotes. Preprocessing should do it, but check manually anyway.

22050hz model produces low-quality speech (frequency range under 11k) which can be checked using Adobe Audition or mel spectrogram.

I wonder if the 44100hz model can produce a wider frequency range like 22k? Thanks in advance.

low-frequency

codexq123 avatar Dec 02 '22 01:12 codexq123

Hello @nikich340

I m trying training an 8000Hz with 2 hours of data and changed it in the config file before training but my audio seems like it's mumbling, not speaking properly.

Here is the sample of original recording

Also the generated audio sound like this

Can you suggest what is wrong with it?

athenasaurav avatar Aug 11 '23 11:08 athenasaurav