Retrieval-based-Voice-Conversion-WebUI Does this support other languages than english?

Hi,

I tried with Turkish speech, but the converted voice changes pronunciation hugely it's not even understandable.

I tried increasing the training epochs but that resulted in even more deviation in pronunciation.

Is that because of the speech language or is it because of some problem with my training and input data?

Thank you for advance.

Jun 16 '23 00:06 PudyJapan

How much training data do you have after silence is removed?
Are you verifying there isn't background noise?
How many epochs are you training to?
Are you using v1 or v2 while training?
Is both the training and target audio in Turkish?

Jun 20 '23 03:06 sethtallen

Same here. I speak catalan and spanish. Really awsome results, I agree and a lot of thanks for the code! But it speaks/sings as an english speaker would in my languages. It's very notorious in the vowels. Is there any way to include non english phonemes? Thanks in advance!

Jun 24 '23 14:06 cadid1961

@sethtallen Thank you for getting back to me and I am sorry for my late response.

How much training data do you have after silence is removed?
- Roughly 4 minutes
Are you verifying there isn't background noise?
- The background noise is minimal, almost can't be heard. I used ultimatevocalremovergui.
How many epochs are you training to?
- 200
Are you using v1 or v2 while training?
- V2
Is both the training and target audio in Turkish?
- Yes

Jun 24 '23 22:06 PudyJapan

@sethtallen Thank you for getting back to me and I am sorry for my late response.

* How much training data do you have after silence is removed?
  
  * Roughly 4 minutes

* Are you verifying there isn't background noise?
  
  * The background noise is minimal, almost can't be heard. I used ultimatevocalremovergui.

* How many epochs are you training to?
  
  * 200

* Are you using v1 or v2 while training?
  
  * V2

* Is both the training and target audio in Turkish?
  
  * Yes

I would recommend more training data. Also save checkpoints and use those and compare it against the 200 epoch model. i.e. a model trained on 150 epochs, 100, 50, etc. See if you notice a difference. I'm not an expert on model training but I have seen people say you can 'overtrain' a model. V2 doesn't need as many epochs. Maybe 200 epochs on 4 minutes is too much. Still I'd recommend at least 10 minutes.

I just want to add context from my personal perspective. I only speak English fluently. I'm learning Japanese. I can't tell when I convert Japanese audio with a model trained on English if it sounds 'natural'. However, when I convert English audio using a model trained on Japanese there is very clear pronunciations that a Japanese person speaking English would make. I assume this is due to sounds which exist in English don't exist in Japanese so the model does not 'learn' them.

There is also a Japanese, Korean, and Chinese community involved in this. So I think its safe to say it does support other languages than English.

Jun 26 '23 02:06 sethtallen

@sethtallen Sounds good, I will check the other epochs and will try with 10 minutes of training data and post the results. I appreciate your help.

Jun 26 '23 09:06 PudyJapan

For me problem solved with more training. With 400 epochs it's almost OK!!! Thanks again!

Jun 28 '23 20:06 cadid1961

Would it be Okay for Indian Language?

Mar 04 '24 17:03 ChengLong-AIMaster

This issue was closed because it has been inactive for 15 days since being marked as stale.

Apr 28 '24 04:04 github-actions[bot]