bark-voice-cloning-HuBERT-quantizer Support for Hindi langauge

@gitmylo , hello I am currently trying to train the quantizer for on hindi dataset.

I need to know how much time would it take to train on a P100 GPU ? And also when should i stop the training ?

given that, I have dataset of approx 7000 wavs and semantic files.

I need to clarify once that will Hubert base model works well for Hindi language ?

Jun 03 '23 07:06 abhiprojectz

with a dataset with 3000 files, 20 minutes on my rtx3060 had good results, i then trained it for an hour or so more. you can interrupt training at any point, and check your latest model to see how well it performs.

Jun 03 '23 10:06 gitmylo

@gitmylo Thanks, I have trained for 15 epochs , and planning to do for 24.

However it took around 3 hours on a P100 GPU for just 15 epochs. and i reduced the files to around 6000.

Any suggestions to improve the cloning results and for faster trainings.

Jun 04 '23 06:06 abhiprojectz

look at it like this. if you have 3000 files, and train for 24 epochs for example, it will still be worse than 6000 files for 12 epochs. an epoch means it has gone through every file. having more training data will make an epoch take longer.

also, if and when you decide to upload your model and/or training data to huggingface, please send the urls here, so i can add them to the readme

Jun 04 '23 10:06 gitmylo

@gitmylo i think, Hubert base model doesn't support hindi language because my generated text doesn't speaks what's prompted with text , instead some random words and noises.

Given that,

I even used 2 types of models example.

Model_A tranied for 23 epochs on 3700 files (after ready stage)

Model_B trained for 16 epochs on 7783 files

Both yields poor results, please any suggestions , already i spent lot of time.

Jun 04 '23 15:06 abhiprojectz

do the wavs used in training sound normal though?

Jun 04 '23 15:06 gitmylo

@gitmylo yes, i just checked multiple wavs (BTW some files are pure noises too) in prepared folder and they sounds perfect.

Can you please suggest by your experience what shall i do?

If you may help, i may train for another langauges too,

Jun 04 '23 16:06 abhiprojectz

maybe there's a Hindi HuBERT model somewhere, you could try loading it

Jun 04 '23 16:06 gitmylo

@gitmylo , I could not found any searched a lot, it would be nice , if you may provide one link for it.

P.S: And a point to be noted that resources such as guides/pretrained_models etc for Hindi langauges are very rare.

a update Model_A crossed 32 epochs with losses as:

Jun 04 '23 16:06 abhiprojectz

@gitmylo I assume there is problem with hubert base model doesn't supports hindi , as i checked with the generated semantic_prompt , i converted them to wav form (sematic_to_waveform) as they speaks some random words with english words, though the cloned speaker entirely speaks Hindi.

P.S cloned on 5-6 speakers, (clear voice) - Same poor results.

Conclusion is that after training for 35 epochs semantic vectors are not formed properly or in desired language.

Thanks, anyways, i will upload all , training data, both the models. But they are of no use.

Jun 04 '23 16:06 abhiprojectz

A good news, i found a way of extracting semantic vectors from wav2vec models without the main hubert_base model.

Jun 05 '23 05:06 abhiprojectz

great, as long as they're on the same rate with the same amount of features, it should work

Jun 05 '23 05:06 gitmylo

Is it distilhubert? And there's different versions around https://huggingface.co/models?search=distilhubert

I noticed it's on RVC too https://github.com/ddPn08/rvc-webui/pull/11

Jun 05 '23 05:06 JonathanFly

@gitmylo hey, i have one doubt , why haven't you used the hubert_base_ls960_L9_km500.bin quantizer ? And what's the reason of training for english language ??

Jun 08 '23 11:06 abhiprojectz

I haven't used that quantizer because it is not compatible with bark. It uses completely different values to represent the semantic features.

I trained on english because english is the most widely spoken language in the world. And it's supported by bark.

Jun 08 '23 11:06 gitmylo

@gitmylo thanks, just one last question,

Is it necessary to pass a input of size 768 to tokenizer, i mean that can we pass input of 1024 or something like to custom tokenizer ( A new that accepts input size of 1024) and then after tokenization will the result be compatible with bark or not ? That is the sematic tokens.

Consider My case is that, i am training a new tokenizer model with input size as 1024, and just need to confirm will the output be bark compatible or not ? Just am need to confirm with you ?

Extra info: My behind thought is that i found a pretrained well formed wav2vec2 model that i somehow to manage to extract semantic vector but output is of size 1024. So planning to train new tokenizer. Should i proceed or not ?

Jun 09 '23 13:06 abhiprojectz

HuBERT wav2vec outputs have 768 features, that's why i picked that number, if you want to use a different number, pass input_size=1024 in the constructor

the default input shape is (B, 768) where B is the batch size, and output shape is (B, 1) with input_size=1024, the input shape is (B, 1024), and output shape is (B, 1)

example: on line 161 of customtokenizer.py, in auto_train, change model_training = CustomTokenizer(version=1).to('cuda') to model_training = CustomTokenizer(version=1, input_size=1024).to('cuda')

Make sure the Wav2Vec extracts features at the same rate as HuBERT too, else you'll get problems.

Jun 09 '23 14:06 gitmylo

Thanks, Can you please shed lights on rate ? i mean what is the required rate ?

Make sure the Wav2Vec extracts features at the same rate as HuBERT too

For example, this indicwav2vec-hindi is trained on fairseq

Jun 09 '23 14:06 abhiprojectz

about 50x768 features per second, or 50x1024 in your case. if it's slightly different, that's fine.

Jun 09 '23 14:06 gitmylo

does hubert_base_ls960.pt pretrained only with English?

Jul 04 '23 15:07 xiabo2011

does hubert_base_ls960.pt pretrained only with English?

Jul 04 '23 15:07 xiabo2011

does hubert_base_ls960.pt pretrained only with English?

it seems to work with more than just english, not every single language though.

Jul 04 '23 16:07 gitmylo

@gitmylo , On hubert training specs its seems its trained on librispeech_asr dataset which is a monolingual [english only] dataset.

Additionally its labelled only english .

Could you confirm , do quantizer or semantic_features returned from hubert model have anything to do with a language ?

https://huggingface.co/facebook/hubert-base-ls960

Jul 04 '23 16:07 abhiprojectz

@gitmylo , On hubert training specs its seems its trained on librispeech_asr dataset which is a monolingual [english only] dataset.

Additionally its labelled only english .

Could you confirm , do quantizer or semantic_features returned from hubert model have anything to do with a language ?

https://huggingface.co/facebook/hubert-base-ls960

They do have something to do with a language, but that won't stop you from creating a good quantizer for a non-english language. as it is still able to recognize the patterns, as it's mostly human speech sounds anyways. It shouldn't be restricted to just english because the quantizer is english.

Jul 04 '23 16:07 gitmylo

@abhiprojectz I had success training the Portuguese language yesterday, and before that I was getting less than the ideal results (the model hallucinated a lot more and the voice clones sucked all around).

I used Hubert.

What I did was: 1 - Lowering the learning rate (in my case it was lowered to 0.0005 ) https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer/blob/master/hubert/customtokenizer.py#L56 2 - Redid the dataset, I threw in the Bible as religious texts apparently tend to produce cadenced speech more often and has a more formal language (better for tokens), and produced over 4000 files (4249 to be exact, but in your case for hindi you should probably use more) 3 - Trained 25 epochs (I selected the 24th epoch model as the best one, though) 4 - Tested each model to check which ones produce good audio + accurate cloned voices

Try lowering the learning rate as much as you can bearably can and let it train for several epochs until you find a sweet spot or if there is any noticeable change in audio generation.

Since you mentioned you where getting "random words and noises" I suggest you to select a learning rate of like 0.0001 and below in order to not "damage" the model

Jul 16 '23 05:07 Subarasheese

@Subarasheese thanks for the insight,

just to clarify do you use hubert base right ? (https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt)

i'm trying to train hubert from scratch with multilngual(eng + indonesia) dataset,after reading this thread

but now i will try using hubert base first

Jul 16 '23 15:07 acul3

@Subarasheese thanks for the insight,

just to clarify do you use hubert base right ? (https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt)

i'm trying to train hubert from scratch with multilngual(eng + indonesia) dataset,after reading this thread

but now i will try using hubert base first

Yes, I used base Hubert. I would suggest you to just train over the base Hubert model before trying to train from the scratch.

Jul 18 '23 06:07 Subarasheese

I'm currently finetuning HuBERT on common_voice_11_0 hindi let's see how it goes

Nov 04 '23 15:11 sachaarbonel

I'm currently finetuning HuBERT on common_voice_11_0 hindi let's see how it goes

Any update on this?

Jan 12 '24 20:01 Surojit-KB

Can someone share results on hindi cloning?

Apr 11 '24 07:04 super-animo

bark-voice-cloning-HuBERT-quantizer bark-voice-cloning-HuBERT-quantizer copied to clipboard

Support for Hindi langauge

bark-voice-cloning-HuBERT-quantizer
bark-voice-cloning-HuBERT-quantizer copied to clipboard