[Feature]: Spin for RVC
Description
Hello,
a user commented a old issue that I've opened in old RVC Project about the "RVCv3 pre-trained model".
He created a simple algorithm for replacing ContentVec with Spin (https://arxiv.org/pdf/2305.11072) which seems to outperform well compared to ContentVec.
https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/issues/2013#issuecomment-2814323965 https://github.com/dr87/spin-for-rvc
Hope these can be useful resources for developing a new architecture for Applio.
Problem
No
Proposed Solution
Implementing a new experimental architecture by training a new pre-trained model, similar to ContentVec, but this time with Spin which offers faster training and better quality. It could be placed with RefineGAN as now it seem to be the new generator network which offers a better quality overall than Hi-Fi GAN does.
Alternatives Considered
Training a new ContentVec, as mentioned in other discussions, is much difficult, requires high computation resources and a heavy vocal dataset. There are for sure several different ContentVec pre-trains but they seem to be not sufficient for going out the limits of the actual quality we have. Other alternatives could include a good re-search about a new disentanglement speaker information algorithm for better performance and quality overall.
It worked now. Test with Spin embedder: https://voca.ro/1irb2JqIM9aY
It worked now. Test with Spin embedder: https://voca.ro/1irb2JqIM9aY
Did you actually use the Spin official pre-trained or you trained your own one?
It worked now. Test with Spin embedder: https://voca.ro/1irb2JqIM9aY
Did you actually use the Spin official pre-trained or you trained your own one?
I used spin official pre-trained model as an speaker embedding and finetuned the v2 pre-trained model (32k).
It worked now. Test with Spin embedder: https://voca.ro/1irb2JqIM9aY
Did you actually use the Spin official pre-trained or you trained your own one?
I used spin official pre-trained model as an speaker embedding and finetuned the v2 pre-trained model (32k).
Alright it's cool. You're free to share what you did and some files, it could help for the developing.
Thanks, big man.
The author of the repository has prepared an updated SPIN checkpoint with 2048 clusters, optimized to enhance RVC model generalization. This version significantly expands upon the previous dataset:
Original: LibriSpeech Clean-100
Updated: LibriSpeech Clean-100 + Clean-360 (Total: ~450 hours)
Speakers: Increased from 251 to 1172
This broader speaker representation appears to accelerate model convergence, likely due to the higher diversity of speaker characteristics. In testing, even with a modest dataset (1h45m of target data), the model adapted quickly when initialized from this checkpoint.
This may represent the upper limit of improvement using clean LibriSpeech data, as the full clean set is now included. Additional data is unlikely to yield further significant gains.
Here is the resource link: https://huggingface.co/dr87/spin-for-rvc/
Oh, okay. I'll test It tomorrow.
I've posted the latest model here https://huggingface.co/Aznamir/spin/tree/main
for now it can be used as a custom embedder, but I would not recommend using it for small datasets - pretrains are trained on regular hubert and wont be able to adjust to the new phoneme encoding with just a finetune.
Most likely all pretrains would have to be re-trained in order to be able to handle a variety of content from speaking to singing.
I've posted the latest model here https://huggingface.co/Aznamir/spin/tree/main
for now it can be used as a custom embedder, but I would not recommend using it for small datasets - pretrains are trained on regular hubert and wont be able to adjust to the new phoneme encoding with just a finetune.
Most likely all pretrains would have to be re-trained in order to be able to handle a variety of content from speaking to singing.
Thank you for the contribution, appreciate it.
I've posted the latest model here https://huggingface.co/Aznamir/spin/tree/main
for now it can be used as a custom embedder, but I would not recommend using it for small datasets - pretrains are trained on regular hubert and wont be able to adjust to the new phoneme encoding with just a finetune.
Most likely all pretrains would have to be re-trained in order to be able to handle a variety of content from speaking to singing.
Finally, Spin pretrained models.
@AznamirWoW sorry to ping, but would you mind including a short readme on huggingface about the different variants of the pretrain? would be huge help for people who are testing / experimenting
https://huggingface.co/Aznamir/spin/tree/main
Currently there are 3 up to date sets: f032k/f040k/f48k_spin7-12_single.pth
Download them into custom pretrain folder and pick them under training any time you use spin as feature extraction. Everything else should be downloaded/selected automatically.
@AznamirWoW I'm confused, so I manually installed the G and D pretrained
but you mentioned something about the feature extraction, are you refering to the custom embedder? which model should i use for the custom embedder? . .
but you mentioned something about the feature extraction, are you refering to the custom embedder? which model should i use for the custom embedder? . .
If you are preparing your dataset and using "spin" embedder, then you need to use that custom spin pretrain to train the model.