voicebox-pytorch
voicebox-pytorch copied to clipboard
Training Example
The training example given seems to be missing the mask vector? In the paper the input to the model was the audio, mask and the phoneme sequence (which was aligned to the audio in the previous implementation of this repo).
So where are the mask vectors and the phoneme sequence used in the training?
Thank You and great appreciation for all you have done.
can you screenshot or paste the relevant section of the paper for said mask?
I'm introducing spear tts conditioning, proven out in the soundstorm repository, and bypassing duration, phoneme, alignment stuff.
.
Alright, I will read up about Spear TTS. Could you tell me what the 'cond' variable actual mean with respect to an audio and transcript?
And we might have to use a different TTS for other languages for the alignment.
Thank You
@YKoustubhRao thanks for the screenshot
i've decided to automatically manage the condition if you were to pass in the binary temporal mask as they said in 3.2, as cond_mask
. it will also be auto generated during training. during inference, you would construct the condition as to zero out the section you would like to infill
@YKoustubhRao i will get the phoneme / duration / aligner stuff finished by end of week along with some training code
Is there a pipeline for denoising and zero shot tts? @lucidrains
Hello lucidrains, can you share your training script and data preparation code to make it easier to try? Thanks in advance.
Any updates on this?
Any updates on this?
Same question.
ah, the code is all in there and @lucasnewman has already trained models successfully. i'll update the readme by end of week
ah, the code is all in there and @lucasnewman has already trained models successfully. i'll update the readme by end of week
Hello. Will the weights be released?
Thank you
ah, the code is all in there and @lucasnewman has already trained models successfully. i'll update the readme by end of week
Hello. Will the weights be released?
Thank you
Hey all, there's a small pretrained model available in this discussion thread: https://github.com/lucidrains/voicebox-pytorch/discussions/29#discussioncomment-7732769
All the training code is in the repo and I put the details for the training hyperparams in the thread, so training your own model should be as straightforward as instantiating the models, dataset, and trainer and calling train() -- if you're having issues, report back and I can try to help.
@lucasnewman Thanks for your hyperparams and pretrained model. It can achieve acceptable results with a batch size of 32 and 100k step on a 4090 GPU.
@lucasnewman Thanks for your hyperparams and pretrained model. It can achieve acceptable results with a batch size of 32 and 100k step on a 4090 GPU.
Hey, can you send us sound samples?
@shigabeev, @lucasnewman has some voice samples in the repo, You should be able to reproduce the same results. If you still need samples let me know, I might be able to send you some
@shigabeev, @lucasnewman has some voice samples in the repo, You should be able to reproduce the same results. If you still need samples let me know, I might be able to send you some
Yeah, I found his trained model on HF, it sounds pretty good. However I wasn't able to figure out how to run in text conditioned mode (TTS). Can you show me the way to do it? Or can you just send some of your audio samples with TTS?
@lucidrains I see that the ConditionalFlowMatcherWrapper class currently lacks support for a Duration Predictor. If you've already worked on this, would it be possible to add it? I'd really appreciate it! Thanks!!!!
@iishapandey hey Isha, thanks for your interest
i would recommend that you take a look at this follow up research, where they best Voicebox with a simpler scheme
there i include a duration predictor