voicebox-pytorch icon indicating copy to clipboard operation
voicebox-pytorch copied to clipboard

Training Example

Open YKoustubhRao opened this issue 1 year ago • 19 comments

The training example given seems to be missing the mask vector? In the paper the input to the model was the audio, mask and the phoneme sequence (which was aligned to the audio in the previous implementation of this repo).

So where are the mask vectors and the phoneme sequence used in the training?

Thank You and great appreciation for all you have done.

YKoustubhRao avatar Sep 25 '23 23:09 YKoustubhRao

can you screenshot or paste the relevant section of the paper for said mask?

lucidrains avatar Sep 25 '23 23:09 lucidrains

I'm introducing spear tts conditioning, proven out in the soundstorm repository, and bypassing duration, phoneme, alignment stuff.

lucidrains avatar Sep 25 '23 23:09 lucidrains

image.

Alright, I will read up about Spear TTS. Could you tell me what the 'cond' variable actual mean with respect to an audio and transcript?

And we might have to use a different TTS for other languages for the alignment.

Thank You

YKoustubhRao avatar Sep 26 '23 00:09 YKoustubhRao

image

YKoustubhRao avatar Sep 26 '23 00:09 YKoustubhRao

@YKoustubhRao thanks for the screenshot

i've decided to automatically manage the condition if you were to pass in the binary temporal mask as they said in 3.2, as cond_mask. it will also be auto generated during training. during inference, you would construct the condition as to zero out the section you would like to infill

lucidrains avatar Sep 26 '23 00:09 lucidrains

@YKoustubhRao i will get the phoneme / duration / aligner stuff finished by end of week along with some training code

lucidrains avatar Sep 26 '23 00:09 lucidrains

Is there a pipeline for denoising and zero shot tts? @lucidrains

YKoustubhRao avatar Oct 05 '23 09:10 YKoustubhRao

Hello lucidrains, can you share your training script and data preparation code to make it easier to try? Thanks in advance.

blldd avatar Oct 07 '23 09:10 blldd

Any updates on this?

kdcyberdude avatar Nov 11 '23 12:11 kdcyberdude

Any updates on this?

Same question.

nrailg avatar Dec 13 '23 14:12 nrailg

ah, the code is all in there and @lucasnewman has already trained models successfully. i'll update the readme by end of week

lucidrains avatar Dec 13 '23 14:12 lucidrains

ah, the code is all in there and @lucasnewman has already trained models successfully. i'll update the readme by end of week

Hello. Will the weights be released?

Thank you

Subarasheese avatar Dec 22 '23 07:12 Subarasheese

ah, the code is all in there and @lucasnewman has already trained models successfully. i'll update the readme by end of week

Hello. Will the weights be released?

Thank you

Hey all, there's a small pretrained model available in this discussion thread: https://github.com/lucidrains/voicebox-pytorch/discussions/29#discussioncomment-7732769

All the training code is in the repo and I put the details for the training hyperparams in the thread, so training your own model should be as straightforward as instantiating the models, dataset, and trainer and calling train() -- if you're having issues, report back and I can try to help.

lucasnewman avatar Dec 22 '23 14:12 lucasnewman

@lucasnewman Thanks for your hyperparams and pretrained model. It can achieve acceptable results with a batch size of 32 and 100k step on a 4090 GPU.

clcarwin avatar Dec 29 '23 04:12 clcarwin

@lucasnewman Thanks for your hyperparams and pretrained model. It can achieve acceptable results with a batch size of 32 and 100k step on a 4090 GPU.

Hey, can you send us sound samples?

shigabeev avatar Jan 16 '24 11:01 shigabeev

@shigabeev, @lucasnewman has some voice samples in the repo, You should be able to reproduce the same results. If you still need samples let me know, I might be able to send you some

wassimseif avatar Jan 18 '24 11:01 wassimseif

@shigabeev, @lucasnewman has some voice samples in the repo, You should be able to reproduce the same results. If you still need samples let me know, I might be able to send you some

Yeah, I found his trained model on HF, it sounds pretty good. However I wasn't able to figure out how to run in text conditioned mode (TTS). Can you show me the way to do it? Or can you just send some of your audio samples with TTS?

shigabeev avatar Jan 18 '24 11:01 shigabeev

@lucidrains I see that the ConditionalFlowMatcherWrapper class currently lacks support for a Duration Predictor. If you've already worked on this, would it be possible to add it? I'd really appreciate it! Thanks!!!!

iishapandey avatar Sep 13 '24 11:09 iishapandey

@iishapandey hey Isha, thanks for your interest

i would recommend that you take a look at this follow up research, where they best Voicebox with a simpler scheme

there i include a duration predictor

lucidrains avatar Sep 13 '24 14:09 lucidrains