candle icon indicating copy to clipboard operation
candle copied to clipboard

MetaVoice-1B: fix degradation compared to Python version

Open vatsalaggarwal opened this issue 11 months ago • 7 comments

The MetaVoice-1B model has significant degradation compared to the Python version. I believe one of the main causes is using a 64x smaller decoder model (instead of multiband diffusion and deepfilternet).

Multiband diffusion is a general purpose diffusion based model that can decode Encodec tokens (which is a Neural Audio codec, and can model diverse speech including music, and audio). So there are additional benefits to have this in the Candle codebase for any other LLMs in the audio/music/speech space.

DeepFilterNet is a powerful speech enhancement model, and so there are also additional benefits to having this within candle.

vatsalaggarwal avatar Mar 04 '24 14:03 vatsalaggarwal

@LaurentMazare does that seem right? Or are there other places where significant quality degradation could be coming from?

vatsalaggarwal avatar Mar 04 '24 14:03 vatsalaggarwal

Right, I think this might explain for most of the difference, I've aligned the first model carefully with a temperature of 0 but not the second model, so there might be other discrepancies coming from there. Another difference is that speaker embeddings are not fully supported in candle at the moment though hopefully this won't be too hard to add (I've started making the appropriate changes).

LaurentMazare avatar Mar 04 '24 15:03 LaurentMazare

Saw the note about the speaker embeddings in your README, that makes sense and, as you say, should be quick to fix! Ah, I got what you meant by "implementation discrepancies" now re: the second stage...

vatsalaggarwal avatar Mar 04 '24 16:03 vatsalaggarwal

Curious what the path is to get the quality working better?

Is it known piece of work TODO or more research TODO before knowing?

I can poke and dig more, have been focused on other issues but they seem to be fixed. I'm not sure what to look at since not sure if it's requiring something that is a known issue/solution or needs more investigation?

Thanks!

groovybits avatar Mar 23 '24 17:03 groovybits

Hey Chris, I would say it's a known piece of work... we'd have to change the decoder currently integrated into candle...

vatsalaggarwal avatar Mar 30 '24 15:03 vatsalaggarwal

Any updates on speaker embeddings support? I'd like to work on it if no one else is currently.

Catchawink avatar May 22 '24 03:05 Catchawink

Any updates on speaker embeddings support? I'd like to work on it if no one else is currently.

I'm not looking at it at the moment, would be great if you can give it a try!

LaurentMazare avatar May 22 '24 06:05 LaurentMazare