Applio [Feature]: Spin for RVC

Description

Hello,

a user commented a old issue that I've opened in old RVC Project about the "RVCv3 pre-trained model".

He created a simple algorithm for replacing ContentVec with Spin (https://arxiv.org/pdf/2305.11072) which seems to outperform well compared to ContentVec.

https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/issues/2013#issuecomment-2814323965 https://github.com/dr87/spin-for-rvc

Hope these can be useful resources for developing a new architecture for Applio.

Problem

No

Proposed Solution

Implementing a new experimental architecture by training a new pre-trained model, similar to ContentVec, but this time with Spin which offers faster training and better quality. It could be placed with RefineGAN as now it seem to be the new generator network which offers a better quality overall than Hi-Fi GAN does.

Alternatives Considered

Training a new ContentVec, as mentioned in other discussions, is much difficult, requires high computation resources and a heavy vocal dataset. There are for sure several different ContentVec pre-trains but they seem to be not sufficient for going out the limits of the actual quality we have. Other alternatives could include a good re-search about a new disentanglement speaker information algorithm for better performance and quality overall.

Apr 18 '25 06:04 ZygoteCode

It worked now. Test with Spin embedder: https://voca.ro/1irb2JqIM9aY

Apr 19 '25 11:04 Gonzaluigi

It worked now. Test with Spin embedder: https://voca.ro/1irb2JqIM9aY

Did you actually use the Spin official pre-trained or you trained your own one?

Apr 19 '25 11:04 ZygoteCode

It worked now. Test with Spin embedder: https://voca.ro/1irb2JqIM9aY

Did you actually use the Spin official pre-trained or you trained your own one?

I used spin official pre-trained model as an speaker embedding and finetuned the v2 pre-trained model (32k).

Apr 19 '25 12:04 Gonzaluigi

It worked now. Test with Spin embedder: https://voca.ro/1irb2JqIM9aY

Did you actually use the Spin official pre-trained or you trained your own one?

I used spin official pre-trained model as an speaker embedding and finetuned the v2 pre-trained model (32k).

Alright it's cool. You're free to share what you did and some files, it could help for the developing.

Apr 19 '25 12:04 ZygoteCode

Thanks, big man.

Apr 19 '25 12:04 Gonzaluigi

The author of the repository has prepared an updated SPIN checkpoint with 2048 clusters, optimized to enhance RVC model generalization. This version significantly expands upon the previous dataset:

Original: LibriSpeech Clean-100

Updated: LibriSpeech Clean-100 + Clean-360 (Total: ~450 hours)

Speakers: Increased from 251 to 1172

This broader speaker representation appears to accelerate model convergence, likely due to the higher diversity of speaker characteristics. In testing, even with a modest dataset (1h45m of target data), the model adapted quickly when initialized from this checkpoint.

This may represent the upper limit of improvement using clean LibriSpeech data, as the full clean set is now included. Additional data is unlikely to yield further significant gains.

Here is the resource link: https://huggingface.co/dr87/spin-for-rvc/

Apr 19 '25 20:04 ZygoteCode

Oh, okay. I'll test It tomorrow.

Apr 19 '25 20:04 Gonzaluigi

I've posted the latest model here https://huggingface.co/Aznamir/spin/tree/main

for now it can be used as a custom embedder, but I would not recommend using it for small datasets - pretrains are trained on regular hubert and wont be able to adjust to the new phoneme encoding with just a finetune.

Most likely all pretrains would have to be re-trained in order to be able to handle a variety of content from speaking to singing.

Apr 19 '25 20:04 AznamirWoW

I've posted the latest model here https://huggingface.co/Aznamir/spin/tree/main

for now it can be used as a custom embedder, but I would not recommend using it for small datasets - pretrains are trained on regular hubert and wont be able to adjust to the new phoneme encoding with just a finetune.

Most likely all pretrains would have to be re-trained in order to be able to handle a variety of content from speaking to singing.

Thank you for the contribution, appreciate it.

Apr 19 '25 21:04 ZygoteCode

I've posted the latest model here https://huggingface.co/Aznamir/spin/tree/main

for now it can be used as a custom embedder, but I would not recommend using it for small datasets - pretrains are trained on regular hubert and wont be able to adjust to the new phoneme encoding with just a finetune.

Most likely all pretrains would have to be re-trained in order to be able to handle a variety of content from speaking to singing.

Finally, Spin pretrained models.

May 15 '25 06:05 Gonzaluigi

@AznamirWoW sorry to ping, but would you mind including a short readme on huggingface about the different variants of the pretrain? would be huge help for people who are testing / experimenting

Jul 12 '25 00:07 mitsuami-megane

https://huggingface.co/Aznamir/spin/tree/main

Currently there are 3 up to date sets: f032k/f040k/f48k_spin7-12_single.pth

Download them into custom pretrain folder and pick them under training any time you use spin as feature extraction. Everything else should be downloaded/selected automatically.

Jul 12 '25 00:07 AznamirWoW

@AznamirWoW I'm confused, so I manually installed the G and D pretrained

but you mentioned something about the feature extraction, are you refering to the custom embedder? which model should i use for the custom embedder? . .

Jul 13 '25 15:07 MethanJess

but you mentioned something about the feature extraction, are you refering to the custom embedder? which model should i use for the custom embedder? . .

If you are preparing your dataset and using "spin" embedder, then you need to use that custom spin pretrain to train the model.

Jul 13 '25 15:07 AznamirWoW