seed-vc icon indicating copy to clipboard operation
seed-vc copied to clipboard

Whisper-large instead of whisper small?

Open sleepingcat4 opened this issue 9 months ago • 22 comments

I read the code and it is very clear if I change Whsiper-small to Whisper larger, what output dim, I should change? @Plachtaa do you have any hints or directions?

sleepingcat4 avatar Feb 15 '25 10:02 sleepingcat4

hi @sleepingcat4 , simply change model_params.length_regulator.in_channels to 1280 in the config file to match whisper-large encoder output dim should work, don't forget to finetune the model after you changed so

Plachtaa avatar Feb 15 '25 10:02 Plachtaa

thanks @Plachtaa for the quick answer! btw if I want to increase parameters to upto 1B, what changes to the DiT architecture should be made? do you have any advice

sleepingcat4 avatar Feb 15 '25 11:02 sleepingcat4

I don't really suggest you to do so as the merit of VC model should be real-time and lightweight, it is not a difficult task that it worth's scaling up to 1B

Plachtaa avatar Feb 15 '25 11:02 Plachtaa

@Plachtaa I wanted to experiment and see if how it may behave since I had some spare compute. I was thinking to increase the number of hidden dim of DiT but if you could suggest some advice for experimentation only, it would be nice.

sleepingcat4 avatar Feb 15 '25 11:02 sleepingcat4

for your reference

Image

Plachtaa avatar Feb 15 '25 11:02 Plachtaa

@Plachtaa thank you for being so helpful. another question, if I change this voice encoder model from "nvidia/bigvgan_v2_22khz_80band_256x" to "https://huggingface.co/nvidia/bigvgan_v2_44khz_128band_512x" what param I should change in the config?

sleepingcat4 avatar Feb 15 '25 14:02 sleepingcat4

see https://github.com/Plachtaa/seed-vc/blob/97544fff2a57db718424c13ecd77f415bd2c7d4d/configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml#L14

Plachtaa avatar Feb 15 '25 14:02 Plachtaa

@Plachtaa and @sleepingcat4 I need help. I followed Plachtaa's instructions and edit the config_dit_mel_seed_uvit_whisper_small_wavenet.yml config file file from

in_channels: 768

to

in_channels: 1280 

and

name: "openai/whisper-small" 

to

name: "openai/whisper-large". 

I also tried changing it to

name: "openai/whisper-large-v2" 

and

name: "openai/whisper-large-v3" 

and

name: "openai/whisper-large-v3-turbo". 

Then I try fine tuning

python train.py --config ./presets/config_dit_mel_seed_uvit_whisper_small_wavenet.yml --dataset-dir Cartoon-Voice --run-name Cartoon-Voice --batch-size 2 --max-steps 600 --max-epochs 1000 --save-every 600 --num-workers 0

And no matter what whisper large model I use, the output file always start speaking in gibberish. Is it because the dataset I gathered is a total of 3 minutes and 20 seconds? https://vocaroo.com/1cTRV93JtsPQ

For reference, source audio https://vocaroo.com/1jKAso7ZrC4C reference audio https://vocaroo.com/1jM7CIP8gROA

GUUser91 avatar Feb 21 '25 02:02 GUUser91

@GUUser91 You must run pretrain on large scale dataset once you swap encoder from whipser-small to whipser-large-v3

Plachtaa avatar Feb 21 '25 07:02 Plachtaa

@GUUser91 definitely, should gather more data. I trained on 2 hours and received awesome results!

sleepingcat4 avatar Feb 22 '25 17:02 sleepingcat4

@sleepingcat4 Did you fine tune a model or did you train from scratch?

GUUser91 avatar Feb 22 '25 18:02 GUUser91

@GUUser91 I just fine-tuned

sleepingcat4 avatar Feb 22 '25 18:02 sleepingcat4

@sleepingcat4 How many steps did you fine tune the model for?

GUUser91 avatar Feb 22 '25 18:02 GUUser91

@GUUser91 400 steps

sleepingcat4 avatar Feb 22 '25 18:02 sleepingcat4

I'm doing the same, but it doesn't detect any GPU to train. It's training too slow in CPU, but I have GPU (rtx 4060)

Gonzaluigi avatar Feb 24 '25 12:02 Gonzaluigi

hi @sleepingcat4, did you change whisper-small to whisper-large-v3 and finetuned only with 4 hours of data ? One more question, 4h is total of a single-speaker dataset or multi-speaker dataset? Many thanks!

leminhnguyen avatar Mar 06 '25 07:03 leminhnguyen

@leminhnguyen Yes, I trained on a multiple speaker dataset did almost 4 hours of data. my data was in very high-quality, it's an 11labs dataset that I had developed and open-sourced a few days earlier on HF under my lab Sleeping AI.

sleepingcat4 avatar Mar 06 '25 16:03 sleepingcat4

@sleepingcat4 thanks friend, keep your good jobs!

leminhnguyen avatar Mar 07 '25 01:03 leminhnguyen

@Plachtaa I have fine-tuned a couple of models and I was running inference. I noticed something weird. Even though my config file specificed a 512 band nvidia BigVGAN, it was loading weights from 256 band BigVGAN. is this an intended behavior?

sleepingcat4 avatar Mar 19 '25 17:03 sleepingcat4

For reference this is my notebook: https://colab.research.google.com/drive/1HeJgMIRpEMd87z5oAcfBfS8_YRLvrwr9?usp=sharing

sleepingcat4 avatar Mar 19 '25 17:03 sleepingcat4

@sleepingcat4 see that you are cloning the forked repo, and there is a bug related to loading the pretrained model in reference.py, which was fixed in https://github.com/Plachtaa/seed-vc/pull/140, let's check it.

One more note, you should load DiT_* file instead of ft_mode.pth

leminhnguyen avatar Mar 20 '25 02:03 leminhnguyen

@sleepingcat4 how come this model https://huggingface.co/sleeping-ai/11labs-seed worked for inference but https://huggingface.co/sleeping-ai/11labs-seed-large and https://huggingface.co/sleeping-ai/11labs-512-large produced unintelligible speech for inference? I've did the setup correctly after downloading your 11labs-512-large files

python app_svc.py --checkpoint ft_model.pth --config config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --fp16 True

GUUser91 avatar Sep 17 '25 00:09 GUUser91