seed-vc
seed-vc copied to clipboard
Whisper-large instead of whisper small?
I read the code and it is very clear if I change Whsiper-small to Whisper larger, what output dim, I should change? @Plachtaa do you have any hints or directions?
hi @sleepingcat4 , simply change model_params.length_regulator.in_channels to 1280 in the config file to match whisper-large encoder output dim should work, don't forget to finetune the model after you changed so
thanks @Plachtaa for the quick answer! btw if I want to increase parameters to upto 1B, what changes to the DiT architecture should be made? do you have any advice
I don't really suggest you to do so as the merit of VC model should be real-time and lightweight, it is not a difficult task that it worth's scaling up to 1B
@Plachtaa I wanted to experiment and see if how it may behave since I had some spare compute. I was thinking to increase the number of hidden dim of DiT but if you could suggest some advice for experimentation only, it would be nice.
for your reference
@Plachtaa thank you for being so helpful. another question, if I change this voice encoder model from "nvidia/bigvgan_v2_22khz_80band_256x" to "https://huggingface.co/nvidia/bigvgan_v2_44khz_128band_512x" what param I should change in the config?
see https://github.com/Plachtaa/seed-vc/blob/97544fff2a57db718424c13ecd77f415bd2c7d4d/configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml#L14
@Plachtaa and @sleepingcat4 I need help. I followed Plachtaa's instructions and edit the config_dit_mel_seed_uvit_whisper_small_wavenet.yml config file file from
in_channels: 768
to
in_channels: 1280
and
name: "openai/whisper-small"
to
name: "openai/whisper-large".
I also tried changing it to
name: "openai/whisper-large-v2"
and
name: "openai/whisper-large-v3"
and
name: "openai/whisper-large-v3-turbo".
Then I try fine tuning
python train.py --config ./presets/config_dit_mel_seed_uvit_whisper_small_wavenet.yml --dataset-dir Cartoon-Voice --run-name Cartoon-Voice --batch-size 2 --max-steps 600 --max-epochs 1000 --save-every 600 --num-workers 0
And no matter what whisper large model I use, the output file always start speaking in gibberish. Is it because the dataset I gathered is a total of 3 minutes and 20 seconds? https://vocaroo.com/1cTRV93JtsPQ
For reference, source audio https://vocaroo.com/1jKAso7ZrC4C reference audio https://vocaroo.com/1jM7CIP8gROA
@GUUser91 You must run pretrain on large scale dataset once you swap encoder from whipser-small to whipser-large-v3
@GUUser91 definitely, should gather more data. I trained on 2 hours and received awesome results!
@sleepingcat4 Did you fine tune a model or did you train from scratch?
@GUUser91 I just fine-tuned
@sleepingcat4 How many steps did you fine tune the model for?
@GUUser91 400 steps
I'm doing the same, but it doesn't detect any GPU to train. It's training too slow in CPU, but I have GPU (rtx 4060)
hi @sleepingcat4, did you change whisper-small to whisper-large-v3 and finetuned only with 4 hours of data ? One more question, 4h is total of a single-speaker dataset or multi-speaker dataset? Many thanks!
@leminhnguyen Yes, I trained on a multiple speaker dataset did almost 4 hours of data. my data was in very high-quality, it's an 11labs dataset that I had developed and open-sourced a few days earlier on HF under my lab Sleeping AI.
@sleepingcat4 thanks friend, keep your good jobs!
@Plachtaa I have fine-tuned a couple of models and I was running inference. I noticed something weird. Even though my config file specificed a 512 band nvidia BigVGAN, it was loading weights from 256 band BigVGAN. is this an intended behavior?
For reference this is my notebook: https://colab.research.google.com/drive/1HeJgMIRpEMd87z5oAcfBfS8_YRLvrwr9?usp=sharing
@sleepingcat4 see that you are cloning the forked repo, and there is a bug related to loading the pretrained model in reference.py, which was fixed in https://github.com/Plachtaa/seed-vc/pull/140, let's check it.
One more note, you should load DiT_* file instead of ft_mode.pth
@sleepingcat4 how come this model https://huggingface.co/sleeping-ai/11labs-seed worked for inference but https://huggingface.co/sleeping-ai/11labs-seed-large and https://huggingface.co/sleeping-ai/11labs-512-large produced unintelligible speech for inference? I've did the setup correctly after downloading your 11labs-512-large files
python app_svc.py --checkpoint ft_model.pth --config config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --fp16 True