seed-vc Whisper-large instead of whisper small?

I read the code and it is very clear if I change Whsiper-small to Whisper larger, what output dim, I should change? @Plachtaa do you have any hints or directions?

Feb 15 '25 10:02 sleepingcat4

hi @sleepingcat4 , simply change model_params.length_regulator.in_channels to 1280 in the config file to match whisper-large encoder output dim should work, don't forget to finetune the model after you changed so

Feb 15 '25 10:02 Plachtaa

thanks @Plachtaa for the quick answer! btw if I want to increase parameters to upto 1B, what changes to the DiT architecture should be made? do you have any advice

Feb 15 '25 11:02 sleepingcat4

I don't really suggest you to do so as the merit of VC model should be real-time and lightweight, it is not a difficult task that it worth's scaling up to 1B

Feb 15 '25 11:02 Plachtaa

@Plachtaa I wanted to experiment and see if how it may behave since I had some spare compute. I was thinking to increase the number of hidden dim of DiT but if you could suggest some advice for experimentation only, it would be nice.

Feb 15 '25 11:02 sleepingcat4

for your reference

Feb 15 '25 11:02 Plachtaa

@Plachtaa thank you for being so helpful. another question, if I change this voice encoder model from "nvidia/bigvgan_v2_22khz_80band_256x" to "https://huggingface.co/nvidia/bigvgan_v2_44khz_128band_512x" what param I should change in the config?

Feb 15 '25 14:02 sleepingcat4

see https://github.com/Plachtaa/seed-vc/blob/97544fff2a57db718424c13ecd77f415bd2c7d4d/configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml#L14

Feb 15 '25 14:02 Plachtaa

@Plachtaa and @sleepingcat4 I need help. I followed Plachtaa's instructions and edit the config_dit_mel_seed_uvit_whisper_small_wavenet.yml config file file from

in_channels: 768

to

in_channels: 1280

and

name: "openai/whisper-small"

to

name: "openai/whisper-large".

I also tried changing it to

name: "openai/whisper-large-v2"

and

name: "openai/whisper-large-v3"

and

name: "openai/whisper-large-v3-turbo".

Then I try fine tuning

python train.py --config ./presets/config_dit_mel_seed_uvit_whisper_small_wavenet.yml --dataset-dir Cartoon-Voice --run-name Cartoon-Voice --batch-size 2 --max-steps 600 --max-epochs 1000 --save-every 600 --num-workers 0

And no matter what whisper large model I use, the output file always start speaking in gibberish. Is it because the dataset I gathered is a total of 3 minutes and 20 seconds? https://vocaroo.com/1cTRV93JtsPQ

For reference, source audio https://vocaroo.com/1jKAso7ZrC4C reference audio https://vocaroo.com/1jM7CIP8gROA

Feb 21 '25 02:02 GUUser91

@GUUser91 You must run pretrain on large scale dataset once you swap encoder from whipser-small to whipser-large-v3

Feb 21 '25 07:02 Plachtaa

@GUUser91 definitely, should gather more data. I trained on 2 hours and received awesome results!

Feb 22 '25 17:02 sleepingcat4

@sleepingcat4 Did you fine tune a model or did you train from scratch?

Feb 22 '25 18:02 GUUser91

@GUUser91 I just fine-tuned

Feb 22 '25 18:02 sleepingcat4

@sleepingcat4 How many steps did you fine tune the model for?

Feb 22 '25 18:02 GUUser91

@GUUser91 400 steps

Feb 22 '25 18:02 sleepingcat4

I'm doing the same, but it doesn't detect any GPU to train. It's training too slow in CPU, but I have GPU (rtx 4060)

Feb 24 '25 12:02 Gonzaluigi

hi @sleepingcat4, did you change whisper-small to whisper-large-v3 and finetuned only with 4 hours of data ? One more question, 4h is total of a single-speaker dataset or multi-speaker dataset? Many thanks!

Mar 06 '25 07:03 leminhnguyen

@leminhnguyen Yes, I trained on a multiple speaker dataset did almost 4 hours of data. my data was in very high-quality, it's an 11labs dataset that I had developed and open-sourced a few days earlier on HF under my lab Sleeping AI.

Mar 06 '25 16:03 sleepingcat4

@sleepingcat4 thanks friend, keep your good jobs!

Mar 07 '25 01:03 leminhnguyen

@Plachtaa I have fine-tuned a couple of models and I was running inference. I noticed something weird. Even though my config file specificed a 512 band nvidia BigVGAN, it was loading weights from 256 band BigVGAN. is this an intended behavior?

Mar 19 '25 17:03 sleepingcat4

For reference this is my notebook: https://colab.research.google.com/drive/1HeJgMIRpEMd87z5oAcfBfS8_YRLvrwr9?usp=sharing

Mar 19 '25 17:03 sleepingcat4

@sleepingcat4 see that you are cloning the forked repo, and there is a bug related to loading the pretrained model in reference.py, which was fixed in https://github.com/Plachtaa/seed-vc/pull/140, let's check it.

One more note, you should load DiT_* file instead of ft_mode.pth

Mar 20 '25 02:03 leminhnguyen

@sleepingcat4 how come this model https://huggingface.co/sleeping-ai/11labs-seed worked for inference but https://huggingface.co/sleeping-ai/11labs-seed-large and https://huggingface.co/sleeping-ai/11labs-512-large produced unintelligible speech for inference? I've did the setup correctly after downloading your 11labs-512-large files

python app_svc.py --checkpoint ft_model.pth --config config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --fp16 True

Sep 17 '25 00:09 GUUser91

seed-vc seed-vc copied to clipboard

Whisper-large instead of whisper small?

seed-vc
seed-vc copied to clipboard