HiFTNet icon indicating copy to clipboard operation
HiFTNet copied to clipboard

Comparison to Vocos?

Open Ryu1845 opened this issue 2 years ago • 7 comments

Hi! I wanted to know if you know about Vocos and if you compared to it since it uses similar principles and has similar results.

Ryu1845 avatar Sep 19 '23 16:09 Ryu1845

Thanks for letting me know this work, though I actually didn’t know this work beforehand so I didn’t compare to this one. I think it’s still quite different from Vocos because in this work we optimize quality over speed, while Vocos optimizes speed over quality.

In the paper, the author shows that Vocos is four times faster than iSTFTNet with comparable performance to BigVGAN-base (I believe the BigVGAN in the paper refers to BigVGAN-base because BigVGAN has 114M parameters though the paper shows it only has 14M parameters), while our work is nearly twice slower than iSTFTNet but significantly outperforms BiGVGAN-base with comparable performance to BigVGAN.

I tried Vocos myself and perceptually it sounds slightly worse than HiFTNet, but it’s indeed much much faster. I think one big advantage of HiFTNet is it works well for singing synthesis while Vocos lags behind because it does not have the hn-NSF. But overall if you care more about the speed Vocos is definitely a much better choice.

Vocos: https://drive.google.com/file/d/1GTZaNlukv0jkNStPJ644oD1s2RJ2GEZW/view?usp=sharing HiFTNet: https://drive.google.com/file/d/1Phu9Z3Q55L08uWd3RKw9q3rVT3DrczWe/view?usp=sharing BigVGAN (not base): https://drive.google.com/file/d/1r-qYcRqk7Qt90Ik55msVlwKhyjcsL787/view?usp=sharing

yl4579 avatar Sep 19 '23 17:09 yl4579

I will leave this issue open if someone is interested in comparing it to Vocos. Probably for singing synthesis, someone can combine these two together to make a fast high-quality singing vocoder. I developed this vocoder primarily for singing voice conversion with SLMGAN but there's no singing data to actually compare so I just compared on LJSpeech and LibriTTS instead.

yl4579 avatar Sep 19 '23 18:09 yl4579

Thank you for your quick reply! This is a great comparison. I can definitely see your work being better for singing synthesis, considering it uses NSF. I'm looking forward to an eventual fast HQ singing vocoder!

Ryu1845 avatar Sep 19 '23 18:09 Ryu1845

I think it’s a good idea. I’ll try to combine these two and test its performance against vocos and see if it’s better but with significant speed improvements. If it works well I’ll add it to the paper later.

yl4579 avatar Sep 19 '23 23:09 yl4579

I have tried to incorporate hn-NSF to vocos but the quality is worse than without it. I think it could be related to how the source should be fed into the model (like STFT before feeding it). It doesn't seem a trivial task so more experiments are needed. If anyone else has time please take a look at it.

yl4579 avatar Oct 03 '23 19:10 yl4579

I think one big advantage of HiFTNet is it works well for singing synthesis while Vocos lags behind because it does not have the hn-NSF.

The 22kHz of the pretrained HiFTNet models are a bit low for my purpose. I think 32kHz is what I would need. Vocos also only supports 24kHz. Would you recommend retraining with different parameters or using some kind of upsampling model at the end? Speed is not so important in my case.

TechInterMezzo avatar Oct 21 '23 15:10 TechInterMezzo

@TechInterMezzo If speed is not a concern, I would recommend you just train an NSF-BigVGAN with the current setup (i.e., a pre-trained F0 network to extract F0). Basically you add NSF to BigVGAN with F0 extracted using a pre-trained F0 network on mel-spectrograms.

yl4579 avatar Oct 28 '23 04:10 yl4579

Hi, what vocoder do you think has the best quality for 44100hz wave output? thank you!

bzp83 avatar Jun 13 '24 01:06 bzp83