femtoGPT icon indicating copy to clipboard operation
femtoGPT copied to clipboard

How to add a new decoder after gpt is created with ::new call ?

Open cutoken opened this issue 1 year ago • 18 comments

hi @keyvank , let's say I want to add a new decoder layer (the one that gets constructed as part of 0..num_layers loop) at run time after the gpt::new() call, how do I go about it ? As I understand you are just pushing the computations one by one incremented by tensorid so adding a layer at a later point of time will need incrementing the ids for the next layers as well (for example adding one more decoder layer along with all the sub layers like attention etc means incrementing the vocab out and other variables outside the for loop ?)

Also why keep computations as a btree when in reality it's being used more like a Vec as we are not even using the id against which each computation is stored (please correct me if I missed something :) )

cutoken avatar Jun 12 '23 05:06 cutoken

Good question. I have done this manually but it's tricky. You have to dis-allow loading tensors coming after the last layer in your new model, and this way you can migrate your old weights to your new model.

Computation is BTreeMap because not all tensors are a results of a computation (E.g input or parameter tensors), and we also need to process them in a sorted way

keyvank avatar Jun 12 '23 09:06 keyvank

@cutoken Please check the latest commit: https://github.com/keyvank/femtoGPT/commit/9d9db580fd6a09d8ddc0b9e1557f353337e17010

keyvank avatar Jun 12 '23 10:06 keyvank

Yes. saw that :) So if I'm understanding this correctly, to add a new layer, I stop the training, set optimizer to false, increase the layers and restart the training ?

cutoken avatar Jun 12 '23 10:06 cutoken

@cutoken Yes, you can turn on optimizer again after it saved the training-data again. (But it will start from step 0, which is maybe not very efficient!)

keyvank avatar Jun 12 '23 10:06 keyvank

It might start from step 0 but will the weights, biases of layer 1 and other intermediate components from training run stay as is or will they be reset to random ?

cutoken avatar Jun 12 '23 10:06 cutoken

new layers are random. old layers keep old weights

keyvank avatar Jun 12 '23 10:06 keyvank

Got it. Now one more ask on similar angle. Is there a way to not run backward pass on a particular layer ? Since they are so costly, I want to exclude earlier layers from the training. Ideal would be some kind of layer numbers I can pass to the optimizer so that it just ignores the computation for those layers.

cutoken avatar Jun 12 '23 10:06 cutoken

You can't do it on a "particular layer". But you can do it only from the last-layer to a particular layer. (

keyvank avatar Jun 12 '23 10:06 keyvank

Ya that is what I actually want. For example, I have started the training with 2 layers, trained them to death :D (loss not reducing anymore), restarted my training with 4 layers, now I want original layers 0,1 not to be having backward pass while new layers 2, 3 (and their sub components like self attention heads etc) to be trained normally. How do I achieve something like that ?

cutoken avatar Jun 12 '23 10:06 cutoken

@cutoken Pushed something that is useful for you. But be aware, this might confuse the Adam optimizer (Since the gradients of other layers will be zero)

keyvank avatar Jun 12 '23 11:06 keyvank

saw your commit. But is the computations matching the layers ? :thinking: I mean the number of computations done in backward pass is always equal to the number of decoder layers ? Only if they are the commit would work I think @keyvank

cutoken avatar Jun 12 '23 11:06 cutoken

Okay that zero grad problem can solved if we just clone the last layer. It's as good as any random value anyway :)

cutoken avatar Jun 12 '23 11:06 cutoken

@cutoken No, it's not. you have to calculate how many computations are done from your new layers till the last computation

keyvank avatar Jun 12 '23 11:06 keyvank

I think this is the formula:

n=3 + ((10*num_heads) + 12) * num_new_layers

where head_size = embedding_degree / num_heads

keyvank avatar Jun 12 '23 11:06 keyvank

Got it. I'm thinking of just storing of the layer number in computation during the creation. That way no need for any calculation and it would reliably work. So call function would take the layer number and layer type. If layer type is decoder and layer number is < start layer/limit we do nothing.

cutoken avatar Jun 12 '23 11:06 cutoken

Hmmmm this could work, but I want this library to be general purpose, what you are proposing needs adding GPT logic into the Graph logic. You can just forget about the calcs and use a number like 200, it doesn't need to be accurate hah

keyvank avatar Jun 12 '23 11:06 keyvank

Yes. This doesn't make much sense unless we expose this as a functionality in the library. For now if I get it working I'll keep it as a add-on in my branch. My guess is that fine tuning would be faster with frozen layers but I wouldn't know unless I try the experiments :D . I'll keep you posted on the results. If it is indeed faster we can consider it in the main branch.

cutoken avatar Jun 12 '23 11:06 cutoken

Update on experimenting with this one:

  1. It does work in saving time
  2. Savings however aren't going to be high unless the model is deep
  3. It has negligible impact on the training quality - will be even better once we remove the need for learning embeddings (through sentencepiece) and positional encodings (through sine and cos) at the input layer.

cutoken avatar Jun 14 '23 07:06 cutoken