vits icon indicating copy to clipboard operation
vits copied to clipboard

Problems with the pronunciation of one word.

Open LanglyAdrian opened this issue 1 year ago • 13 comments

I downloaded the dataset, started training with all the initial parameters. I changed only the batch size to 32. I reached 700k steps. As a result, he pronounces long phrases well, but if it is one word, the result is terrible. I don't think it makes sense to continue learning.

LanglyAdrian avatar Mar 28 '23 21:03 LanglyAdrian

@jaywalnut310, hi! Can you help me? Ready to pay! You created a great project and got great results, but I can't repeat them. After reading this, I suggested that the problem is that I changed the batch size. Could you tell me exactly how to change the rest of the parameters if I changed the batch size from 64 to 32?

LanglyAdrian avatar Mar 29 '23 11:03 LanglyAdrian

@jaywalnut310, hi! Can you help me? Ready to pay! You created a great project and got great results, but I can't repeat them. After reading this, I suggested that the problem is that I changed the batch size. Could you tell me exactly how to change the rest of the parameters if I changed the batch size from 64 to 32?

Hi, what dataset do you use? And what problem exactly do you want to solve? Short phrases pronunciation ?

NikitaKononov avatar Mar 30 '23 22:03 NikitaKononov

@jaywalnut310, hi! Can you help me? Ready to pay! You created a great project and got great results, but I can't repeat them. After reading this, I suggested that the problem is that I changed the batch size. Could you tell me exactly how to change the rest of the parameters if I changed the batch size from 64 to 32?

We can connect elsewhere to solve your issue faster, if it's still actual

NikitaKononov avatar Mar 30 '23 22:03 NikitaKononov

@NikitaKononov , hi!

I trained the model 2 times.

  1. I downloaded this dataset, lowered the frequency of wav files to 22050 Hz, then deleted 80 entries (both wav files and from filelists), the size of which was more than 500kb and no .spec.pt files were created for them. Since I did this so that I could train my dataset (about 35 minutes), I decided to replace one of the voices (under ID 78) with mine. I chose 78 because it contained the same number of files as my dataset. I ran "python preprocess.py --text_index 1..." and noticed that some phrases (about 500) differ from those that the author received. I decided to use the new ones (which I received), because I didn’t want there to be a difference between the algorithm for obtaining phonemes for my dataset and for everyone else. Because I couldn’t run it with a batch size of 64 (I have a GeForce RTX 3090), I reduced the size to 32. As a result, I got the following results: Phrase 1: "These wards were all fitted with barrack-beds, but no bedding was supplied." My, author. Phrase 2: "capital" My, author

As you can see, everything is fine with a long phrase, but with a short one…

I thought that perhaps the problem was that I used my own dataset instead of voice 78, but when I found this, I realized that this problem is far from being only with me and, accordingly, my dataset has nothing to do with it. I thought that perhaps the problem is in the silence, which is in many files. And then I started looking for datasets in other places.

  1. Found this dataset. I noticed that there seems to be no problems with silence. Refused ID 315 and ID 362 because there were differences between wav files and filelists. Instead, I put my own dataset and s5 (whose voice was in the dataset from a new source). Changed the format from flac to wav, and also lowered the frequency of wav files to 22050 Hz. Then again I deleted the files, the size of which was more than 500kb, created new filelists leaving 500 files for test and 100 for val, as in the original, and started training (again with a batch size of 32). As a result, I got similar results as in the first training.

If you feel more comfortable elsewhere, we could continue on facebook, twitter, email. However, in the end, I'd like to post the solution here to help others.

LanglyAdrian avatar Mar 30 '23 23:03 LanglyAdrian

@NikitaKononov , By the way, if you are from Russia, we could communicate in Russian. My English is very bad.

LanglyAdrian avatar Mar 31 '23 08:03 LanglyAdrian

@NikitaKononov , By the way, if you are from Russia, we could communicate in Russian. My English is very bad.

Telegram @drakononov e-mail [email protected]

NikitaKononov avatar Mar 31 '23 08:03 NikitaKononov

@NikitaKononov , hi!

I trained the model 2 times.

  1. I downloaded this dataset, lowered the frequency of wav files to 22050 Hz, then deleted 80 entries (both wav files and from filelists), the size of which was more than 500kb and no .spec.pt files were created for them. Since I did this so that I could train my dataset (about 35 minutes), I decided to replace one of the voices (under ID 78) with mine. I chose 78 because it contained the same number of files as my dataset. I ran "python preprocess.py --text_index 1..." and noticed that some phrases (about 500) differ from those that the author received. I decided to use the new ones (which I received), because I didn’t want there to be a difference between the algorithm for obtaining phonemes for my dataset and for everyone else. Because I couldn’t run it with a batch size of 64 (I have a GeForce RTX 3090), I reduced the size to 32. As a result, I got the following results: Phrase 1: "These wards were all fitted with barrack-beds, but no bedding was supplied." My, author. Phrase 2: "capital" My, author

As you can see, everything is fine with a long phrase, but with a short one…

I thought that perhaps the problem was that I used my own dataset instead of voice 78, but when I found this, I realized that this problem is far from being only with me and, accordingly, my dataset has nothing to do with it. I thought that perhaps the problem is in the silence, which is in many files. And then I started looking for datasets in other places.

  1. Found this dataset. I noticed that there seems to be no problems with silence. Refused ID 315 and ID 362 because there were differences between wav files and filelists. Instead, I put my own dataset and s5 (whose voice was in the dataset from a new source). Changed the format from flac to wav, and also lowered the frequency of wav files to 22050 Hz. Then again I deleted the files, the size of which was more than 500kb, created new filelists leaving 500 files for test and 100 for val, as in the original, and started training (again with a batch size of 32). As a result, I got similar results as in the first training.

If you feel more comfortable elsewhere, we could continue on facebook, twitter, email. However, in the end, I'd like to post the solution here to help others.

I use rtx3090 right now, it can handle bs 64 with AMP (fp16_train = true in config) If you decrease batch size from 64 to 32, you should decrease learning rate 2 times from 2e-4 to 1e-4 same with increasing. For GANs it's important

VCTK use "magic" sentence sequence, that tries to maximize fonetic coverage of each speaker. So 35 minutes of common data can be not enough. You can show a couple of examples from you dataset, so I can evaluate your data quality (slicing, transcription)

NikitaKononov avatar Mar 31 '23 08:03 NikitaKononov

In your samples I can clearly hear data-hunger typical for VITS or it can be a syndrome of poor data markup quality or both

LR of course matters too

NikitaKononov avatar Mar 31 '23 09:03 NikitaKononov

If it is caused by data-hunger, then how much data needed for each speaker if I make a multi-speaker instance?

JohnHerry avatar Apr 19 '23 06:04 JohnHerry

@JohnHerry If your question concerns the problem that I described, then this is not due to the fact that there is not enough data.

LanglyAdrian avatar Apr 24 '23 18:04 LanglyAdrian

Hi,@NikitaKononov! is it possible to use vctk russian dataset?

inventor617 avatar May 10 '23 15:05 inventor617

Try to add short phrases into your dataset. If it's trained to say some phonemes only in connection with other, it can't do single word well.

nikich340 avatar Jun 12 '23 04:06 nikich340

If it is caused by data-hunger, then how much data needed for each speaker if I make a multi-speaker instance?

About 2 hours is minimum for good result.

nikich340 avatar Jun 12 '23 04:06 nikich340