VoiceCraft
VoiceCraft copied to clipboard
Few questions about the paper. [Encodec;inference speed; model parameters]
Hi @jasonppy , great work and samples, thanks for sharing the code!
Introduction of causal masking for TTS - is an elegant approach for contextualization. Bravo!
I'm curious about few aspects of your work at the moment:
- Did you train Encodec as well? To my knowledge the parameters are released too. But looking into your code, it seams that you trained it too. Now I wonder what might be a reason for this. A hypothesis: no parameters for 16 kHz sampling rate?
- When it comes to inference you mention that you run it multiple times. Can you share inference speed for say 10 seconds long utterance on 820M model?
- Is there any estimate when model parameters will be released?
Have a good one! Best, Taras
Thanks!
- Did you train Encodec as well? To my knowledge the parameters are released too. But looking into your code, it seams that you trained it too. Now I wonder what might be a reason for this. A hypothesis: no parameters for 16 kHz sampling rate?
Yes we trained Encodec as well. We will also open source the trained encodec
- When it comes to inference you mention that you run it multiple times. Can you share inference speed for say 10 seconds long utterance on 820M model?
for the 830M model, the generation time is faster than real time for 10 seconds long utterance on a A40 GPU, more details will be added to the camera-ready paper
- Is there any estimate when model parameters will be released?
Model parameters will be released by the end of this month.
Thanks, looking forwards to seem more details!
@jasonppy hi, can you explain why you chose to retrain encodec instead of using the released model? is 8 codebooks too much?