piper icon indicating copy to clipboard operation
piper copied to clipboard

Train voice having 44Khz sampling rate

Open donlk opened this issue 1 year ago • 6 comments

Hi! I have appr. 1.5 hours of audio voice at 44Khz and like to train a usable model from it. I don't want to retrain, as the pre-trained checkpoints are all 22Khz, sounding muddy and not that good. I tried training from scratch, specifying the correct sampling_rate of 44100. Reached 2000 epochs, but the inferred audio was way too fast, skipping words in the process.

What should I modify or patch in to make this work?

thanks!

donlk avatar Sep 15 '24 13:09 donlk

i suggest resampling your data to 22050 Hz. you can use ffmpeg to do so

agonzalezd avatar Sep 17 '24 13:09 agonzalezd

I would abstain from that if possible, due to huge quality loss.

donlk avatar Sep 18 '24 23:09 donlk

Make sure the samplerate is set correctly everywhere, not just training but also inference: https://github.com/search?q=repo%3Arhasspy%2Fpiper%2022050&type=code

Other than that my guess is that you would need to adapt the decoder parameters here: https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L30

Luke100000 avatar Oct 07 '24 15:10 Luke100000

@donlk I am trying the exact same thing as you did. Only wish I had seen this before wasting the money on the training. Did you figure out any final solution to this?

DK013 avatar Aug 28 '25 05:08 DK013

The audio parameters that @Luke100000 linked are tuned for 22Khz (not by me, by the original authors). Did you choose "high" quality when using 44Khz data?

synesthesiam avatar Aug 28 '25 21:08 synesthesiam

The audio parameters that @Luke100000 linked are tuned for 22Khz (not by me, by the original authors). Did you choose "high" quality when using 44Khz data?

in my case I went with "medium" quality since in docs both medium and high uses the same sample rate, I figured the result will be the same

DK013 avatar Sep 04 '25 05:09 DK013