metavoice-src LoRA Fine Tuning

This is just the first draft so we can start building this feature.

Added dataloader.py, which loads data for training
Added train.py, with the current training loop
Added lora.py, for LoRA wrapper of the stage 1 Transformer
Added dummy_dataset folder with 25 data samples to work with when testing (VCTK-->p311)
Commented out the initial inference code when stage 1 model is built.

There is no batch processing in the training loop currently (was getting some dimension mismatching in the KVCache.update function, probably not that difficult to solve).

The dataloader works fine, but everything else requires work. This is just a dirty initial draft so we can start working on this thing together! :-)

Hope to hear some insights! I have time to put into this feature, so any pointers would be great! Am not super well-versed in AI, so bear with me.

Mar 04 '24 15:03 danablend

I'm going to try getting a very basic thing to run. Currently, there are a couple issues:

LoRA parameter sizes don't match properly with the token embeddings and output layers. This causes tensor dimensions mismatch error. Also, not sure if these are the two layers we actually want to add LoRAs to. Probably need to test a lot, including figuring what the rank value should be.
def _train_infer_stage_one breaks

Can make it clean once we've got a basic loop running successfully

Mar 04 '24 15:03 danablend

I'm going to try getting a very basic thing to run. Currently, there are a couple issues:

LoRA parameter sizes don't match properly with the token embeddings and output layers. This causes tensor dimensions mismatch error. Also, not sure if these are the two layers we actually want to add LoRAs to. Probably need to test a lot, including figuring what the rank value should be. def _train_infer_stage_one breaks Can make it clean once we've got a basic loop running successfully

Got you. Are you not currently having to load all the models onto GPU VRAM though? Given the constraints of a 16GB VRAM, I would probably:

only work with the first stage (this should reduce memory usage significantly as the diffusion model is ~5GB itself)
swap out the Adam optimiser with a SGD optimiser (just to get things running - this should reduce memory usage significantly)
finetune only the first layer, ignore the token embeddings/logit output layer

Mar 04 '24 16:03 vatsalaggarwal

I'm going to try getting a very basic thing to run. Currently, there are a couple issues: LoRA parameter sizes don't match properly with the token embeddings and output layers. This causes tensor dimensions mismatch error. Also, not sure if these are the two layers we actually want to add LoRAs to. Probably need to test a lot, including figuring what the rank value should be. def _train_infer_stage_one breaks Can make it clean once we've got a basic loop running successfully

Got you. Are you not currently having to load all the models onto GPU VRAM though? Given the constraints of a 16GB VRAM, I would probably:

only work with the first stage (this should reduce memory usage significantly as the diffusion model is ~5GB itself)

swap out the Adam optimiser with a SGD optimiser (just to get things running - this should reduce memory usage significantly)

finetune only the first layer, ignore the token embeddings/logit output layer

Hey @vatsalaggarwal, these are all very helpful insights - much appreciated!

I will implement the points you mentioned:

Only work with first stage
Modify dataloader to return the first two hierarchies of the encodec tokens
Swap over to SGD optimizer to save on memory
Only fine tune the first layer of the first stage to adapt to new speakers (the use case)

Makes sense to just work with stage 1 with cross entropy loss, I just wasn't able to get there - so thanks for that insight.

Should be able to commit these changes later today or tomorrow!

Mar 04 '24 18:03 danablend

Makes sense to just work with stage 1 with cross entropy loss, I just wasn't able to get there - so thanks for that insight.

Not sure what you mean, but does https://github.com/karpathy/nanoGPT/blob/master/model.py#L184 help?

Mar 05 '24 22:03 vatsalaggarwal

Makes sense to just work with stage 1 with cross entropy loss, I just wasn't able to get there - so thanks for that insight.

Not sure what you mean, but does https://github.com/karpathy/nanoGPT/blob/master/model.py#L184 help?

It massively helped - I've not worked with transformer networks before (only diffusion & GANs), so studying nanoGPT after you sent it today was very helpful!

Am just making a few more changes for today, then I'm pushing an update :-)

Mar 06 '24 00:03 danablend

Just committed - here's an overview:

Switched from Adam to SGD optimizer
Modified DataLoader to return first two Encodec token hierarchies as a flattened interleaved tensor (let me know if that looks ok to you?)
Modified LoRA wrapper to only fine tune speaker_cond_pos layer
- In nanoGPT-LoRA they fine tune the causal attention layer (https://github.com/danielgrittner/nanoGPT-LoRA/blob/master/model.py#L128). Would it be worth trying something similar with the attention layer here?
Modified training loop
- Forward pass entire batch at a time
- Wasn't able to format input_pos as (B, T) due to tensor dimension mismatch in KVCache. Unsure if this causes downstream effects or if that's fine?
- Loss function call causes error right now. We need to match the GT labels with the model probs. I need some direction here, because I'm not quite sure about what the "unit" of the model's raw output is and what format we need the labels and model outputs to have to correctly calculate the loss. I assumed that the model would output something like a probability distribution of (B, S, V), B=batch_size, S=gen_encodec_tokens_count, V=vocab_size. But it's (B, T, V), with T=text_prompt_token_count. Not sure exactly how to get the data into the right format here for loss calculation, but the remainder of this implementation should be straightforward once this works.

EDIT1: After sleeping on it, I think it is an issue with the way the GT Encodec tokens are extracted and prepared in the DataLoader which is causing the problem, and not the format of the model output.

EDIT2: Was wondering if you have more detailed documentation about the model architecture / diagrams to help further understanding?

Mar 06 '24 01:03 danablend

I might start to understand a little bit here.

Would the idea be to generate a single token at a random location for each batch by giving the model: (prompt) + (random number of GT encodec tokens) as the input?

So we would concat (B, P) and (B, T_random) prompt and encodec tokens respectively, and then use that to generate (B, V) tensor logits for B tokens per batch?

Then input_pos = torch.arange(0, P+T_rand) for a given T_rand ∈ [0, T_max - 1].

This would then give us prediction tensor for the batch (B, V), for which we would calculate the loss.

Then the GT labels would be (B, 1), where we take (P+T_rand) on the 2nd dimension from the GT encodec tokens (the next GT token in the sequence that the model is supposed to predict)?

Is that the right intuition?

If so, would the GT Encodec labels be determined by the raw output of the EncodecModel.encode function, or do we need to run it through the first stage adapter to get the correct vocab indices? If you're not sure, I'll see if I can find out.

Again, thanks a lot for pointing me in the right direction, it's very helpful given my limited knowledge!

Mar 06 '24 10:03 danablend

I've modified the training loop to be as I described above, and it seems to work (although I'm not 100% sure that the GT Encodec indices are correctly determined).

Somewhere I have made a mistake, causing tensors to be retained in the computational graph across batches, which leads to: "RuntimeError: Trying to backward through the graph a second time" on the 2nd iteration. Maybe you would be able to see I made this mistake @vatsalaggarwal?

Let me know any thoughts! :-)

Script can be launched with mixed precision using accelerate now: accelerate launch train.py --mixed-precision bfloat16 for bfloat16 accelerate launch train.py --mixed-precision float16 for float16

Mar 06 '24 12:03 danablend

Still some work to do in dataloader to ensure proper windows (aka blocks in nanoGPT) of X,Y data is prepared for good training. Right now useless data might be used due to randomly selecting padded zero values. Can be fixed in dataloader, or might end up creating a get_batch function similar to that of nanoGPT.

Haven't been able to solve the error which occurs when calling .backward for the 2nd time: "RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed)."

I'm sure it's a very obvious error I've made, but I have not been able to isolate it.

Mar 06 '24 18:03 danablend

I think I've isolated the backward error (see above) to the caching mechanism in the model. Will work on a solution today and then we should be able to have something training

EDIT: It is definitely the caching mechanism. I just got it to run when clearing and detaching K,V tensors in KVCache between iterations.

Mar 07 '24 08:03 danablend

in the middle of finishing something, haven't had time to look at this, will do it soon, sorry!

Mar 07 '24 08:03 vatsalaggarwal

in the middle of finishing something, haven't had time to look at this, will do it soon, sorry!

Completely understand, no pressure! I'm just posting updates for whenever you have the time :-)

Mar 07 '24 08:03 danablend

the updates are super helpful though!

Mar 07 '24 08:03 vatsalaggarwal

I have the model training the LoRA layers now, but the data preparation process is currently garbage and I'm probably also calculating the loss with unit mismatch between the prompt input, inference output, and GT labels.

Will play with this for a bit, but I might need some help correctly interpreting the "units" of the variables here so we can prepare them correctly for the prompt as well as for the loss function.

Gonna draw some inspiration from nanoGPT and improve the data preparation, and then we're getting close to running some tests.

Mar 07 '24 09:03 danablend

For this input (b, t, vocab_size), would b be the 2 predicted hierarchies of encodec tokens and not the batch size?

b is batch size, t is timesteps, vocab_size is vocab_size... the hierarchies are flattened and interleaved into one for the first stage model

My intuition is wrong here, and the output of the model is (2, S, V) where I would expect (B, 2, S, V)

does my previous comment help?

I tried this, but it causes dimension mismatch in KVCache and Attention

I will work on making KVCache and Attention compatible with a (B, S) input_pos, but you know much more about this than me, so maybe you could share your insight on this?

KVCaching isn't relevant during training, that should be switched off...

Mar 07 '24 16:03 vatsalaggarwal

I think I've isolated the backward error (see above) to the caching mechanism in the model.

this shouldn't be used during training

Haven't been able to solve the error which occurs when calling .backward for the 2nd time

Do you mean for the second iteration? Have you zeroed gradients?

I might start to understand a little bit here.

Would the idea be to generate a single token at a random location for each batch by giving the model: (prompt) + (random number of GT encodec tokens) as the input?

So we would concat (B, P) and (B, T_random) prompt and encodec tokens respectively, and then use that to generate (B, V) tensor logits for B tokens per batch?

Then input_pos = torch.arange(0, P+T_rand) for a given T_rand ∈ [0, T_max - 1].

This would then give us prediction tensor for the batch (B, V), for which we would calculate the loss.

Then the GT labels would be (B, 1), where we take (P+T_rand) on the 2nd dimension from the GT encodec tokens (the next GT token in the sequence that the model is supposed to predict)?

Is that the right intuition?

Didn't get this... best thing might be to work through the NanoGPT training loop... but rough idea is to create a row of (B, S) which contains all the text and audio tokens concatenated together. Then, you apply next-token pred. This does the right thing because of causal masking.

Mar 07 '24 16:03 vatsalaggarwal

I think I've isolated the backward error (see above) to the caching mechanism in the model.

this shouldn't be used during training

Haven't been able to solve the error which occurs when calling .backward for the 2nd time

Do you mean for the second iteration? Have you zeroed gradients?

I might start to understand a little bit here. Would the idea be to generate a single token at a random location for each batch by giving the model: (prompt) + (random number of GT encodec tokens) as the input? So we would concat (B, P) and (B, T_random) prompt and encodec tokens respectively, and then use that to generate (B, V) tensor logits for B tokens per batch? Then input_pos = torch.arange(0, P+T_rand) for a given T_rand ∈ [0, T_max - 1]. This would then give us prediction tensor for the batch (B, V), for which we would calculate the loss. Then the GT labels would be (B, 1), where we take (P+T_rand) on the 2nd dimension from the GT encodec tokens (the next GT token in the sequence that the model is supposed to predict)? Is that the right intuition?

Didn't get this... best thing might be to work through the NanoGPT training loop... but rough idea is to create a row of (B, S) which contains all the text and audio tokens concatenated together. Then, you apply next-token pred. This does the right thing because of causal masking.

Thanks for explaining, that makes sense. I've implemented it, similar to NanoGPT. Pushing momentarily.

Mar 07 '24 23:03 danablend

The LoRA layer (1st layer in model) is 98k trainable parameters. Will try training now to validate the current code.

Mar 08 '24 19:03 danablend

Currently sweeping learning rate and LoRA rank, alpha, dropout. Graphs look like this so far:

I'm unsure whether the data is fed into the model properly @vatsalaggarwal. I measure very high loss on the frozen foundational model (loss 8 - 12) when evaluating the model, but the audio sounds just fine. During training of LoRA layers I can reduce loss down to 4-5, indicating to me that the LoRAs work but that the data is formatted differently than how it was formatted when training the foundational model.

The data is prepared like in nanoGPT but with audio tokens added after the text tokens:

Loop through all text & wav files of dataset
Tokenize text with BPE
Extract Encodec tokens from wav with pretrained EncodecModel (bandwidth=6.0 kBps), take only first 2 hierarchies (first 2 indices) of predicted Encodec tokens. Then add 1024 to each 2nd Encodec token to separate the hierarchies when flattened starting from the 2nd idx. This was similar to how predicted tokens came out when I debugged the demo with app.py.
Sequence = <tokenized text + encodec tokens + EOS> concatted. Here, I have tried EOS=1024 and EOS = 2048, but it didn't make a big difference on loss measurements. I also tried appending EOS between the tokenized text and encodec tokens, but this also made little difference in the loss evaluation on the foundational model.
Concat Sequence to the FULL dataset tensor (one huge 1-dimensional tensor with EOS between each data point) like in nanoGPT

During Training:

A sliding window of block_size length randomly selects a window from the full dataset tensor, feeds it into the model and performs next-token pred.
Loss is calculated with Cross Entropy on the raw logits, and the target idxs are obtained by shifting the input window +1 to the right on our data to get the sequence that the model is supposed to predict, like in nanoGPT.

*input_pos is always set to torch.arange(0, block_size), since we are sliding a block_size window over the data to get the input *block_size = 2048

Is this the same way that the foundational model was trained @vatsalaggarwal? I would just expect the loss to start out very low, given its ability to generalize extremely well. Were there any nuances when you trained the model that this doesn't account for?

Mar 11 '24 13:03 danablend

I think @lucapericlp is close to turning this into a working solution (without LoRA), so might be better to add LoRA to that... hopefully should be out by EOD.

The way you mentioned works for training text-only model, but we had some trouble training text+audio models that way... so the data formats etc were slightly differently, hopefully upcoming PR should clarify!

Mar 12 '24 11:03 vatsalaggarwal

I think @lucapericlp is close to turning this into a working solution (without LoRA), so might be better to add LoRA to that... hopefully should be out by EOD.

The way you mentioned works for training text-only model, but we had some trouble training text+audio models that way... so the data formats etc were slightly differently, hopefully upcoming PR should clarify!

That is great to hear! I am very interested to see how the solution looks.

Should quickly be able to follow up with added LoRAs to @lucapericlp's solution once pushed. Can run some sweeps to find us the best configuration as well :-)

Thanks!

Mar 12 '24 13:03 danablend

I think @lucapericlp is close to turning this into a working solution (without LoRA), so might be better to add LoRA to that... hopefully should be out by EOD.

The way you mentioned works for training text-only model, but we had some trouble training text+audio models that way... so the data formats etc were slightly differently, hopefully upcoming PR should clarify!

Eagerly awaiting this as well! Was fun reading the progress here too @danablend :)

Mar 12 '24 17:03 makorihi

@danablend / @makorihi check out https://github.com/metavoiceio/metavoice-src/pull/93#pullrequestreview-1934338050

Mar 13 '24 14:03 vatsalaggarwal

@danablend / @makorihi check out #93 (review)

Have just been reading through it! I'll get it up on my system and add LoRAs to this as soon as possible, hopefully today or tomorrow!

This is great - cheers @lucapericlp @vatsalaggarwal.

Mar 13 '24 14:03 danablend

@danablend I think the main thing to keep in mind is: each row is composed of "text tokens | audio tokens | padding"... I think you were taking arbitrary segments of these (per NanoGPT) as that's what they do in the text-LLM trainings, but it doesn't work so great for text->speech training..

Mar 13 '24 14:03 vatsalaggarwal

@vatsalaggarwal

@danablend I think the main thing to keep in mind is: each row is composed of "text tokens | audio tokens | padding"... I think you were taking arbitrary segments of these (per NanoGPT) as that's what they do in the text-LLM trainings, but it doesn't work so great for text->speech training..

~~looking at the README in the other PR, I see~~

audio_files|captions
./data/audio.wav|./data/caption.txt

~~but you mention there's a | padding necessary in the finetune dataset as well? or is that post-ingestion within the system~~

ah, sorry, I misunderstood 😅

Mar 13 '24 14:03 makorihi

@vatsalaggarwal

@danablend I think the main thing to keep in mind is: each row is composed of "text tokens | audio tokens | padding"... I think you were taking arbitrary segments of these (per NanoGPT) as that's what they do in the text-LLM trainings, but it doesn't work so great for text->speech training..

looking at the README in the other PR, I see
audio_files|captions
./data/audio.wav|./data/caption.txt
but you mention there's a | padding necessary in the finetune dataset as well? or is that post-ingestion within the system

I believe padding is appended in the data preparation step here. Padding token is 2048 as set here

Mar 13 '24 14:03 danablend

I have gotten LoRAs to train based off of @lucapericlp's awesome work.

Gonna clean it up and prepare for review. Probably going to happen in the weekend.

Mar 14 '24 22:03 danablend

I have gotten LoRAs to train based off of @lucapericlp's awesome work.

Gonna clean it up and prepare for review. Probably going to happen in the weekend.

NICE! what sort of losses are you getting, and audio?

Did you use https://github.com/huggingface/peft, or did you wrap your own as in this PR in the end?

Mar 14 '24 22:03 vatsalaggarwal

I have gotten LoRAs to train based off of @lucapericlp's awesome work. Gonna clean it up and prepare for review. Probably going to happen in the weekend.

NICE! what sort of losses are you getting, and audio?

Did you use https://github.com/huggingface/peft, or did you wrap your own as in this PR in the end?

I've seen losses down at around 0.42 with 15m parameters, just on the last block's attention layer. I haven't actually written the code that loads the model with added LoRA layers yet, so I'm going to get some audio samples once that's there.

I took the LoRA from earlier, which is an adaptation of the one from nanoGPT-LoRA. Would you prefer if I used the Peft library from HF? Could probably change it without too much hassle. It's not so much extra code, now that the fine tuning & data loading has been cracked.

Mar 15 '24 11:03 danablend

metavoice-src metavoice-src copied to clipboard

LoRA Fine Tuning

metavoice-src
metavoice-src copied to clipboard