femtoGPT
femtoGPT copied to clipboard
How to add a new decoder after gpt is created with ::new call ?
hi @keyvank , let's say I want to add a new decoder layer (the one that gets constructed as part of 0..num_layers loop) at run time after the gpt::new() call, how do I go about it ? As I understand you are just pushing the computations one by one incremented by tensorid so adding a layer at a later point of time will need incrementing the ids for the next layers as well (for example adding one more decoder layer along with all the sub layers like attention etc means incrementing the vocab out and other variables outside the for loop ?)
Also why keep computations as a btree when in reality it's being used more like a Vec as we are not even using the id against which each computation is stored (please correct me if I missed something :) )
Good question. I have done this manually but it's tricky. You have to dis-allow loading tensors coming after the last layer in your new model, and this way you can migrate your old weights to your new model.
Computation is BTreeMap because not all tensors are a results of a computation (E.g input or parameter tensors), and we also need to process them in a sorted way
@cutoken Please check the latest commit: https://github.com/keyvank/femtoGPT/commit/9d9db580fd6a09d8ddc0b9e1557f353337e17010
Yes. saw that :) So if I'm understanding this correctly, to add a new layer, I stop the training, set optimizer to false, increase the layers and restart the training ?
@cutoken Yes, you can turn on optimizer again after it saved the training-data again. (But it will start from step 0, which is maybe not very efficient!)
It might start from step 0 but will the weights, biases of layer 1 and other intermediate components from training run stay as is or will they be reset to random ?
new layers are random. old layers keep old weights
Got it. Now one more ask on similar angle. Is there a way to not run backward pass on a particular layer ? Since they are so costly, I want to exclude earlier layers from the training. Ideal would be some kind of layer numbers I can pass to the optimizer so that it just ignores the computation for those layers.
You can't do it on a "particular layer". But you can do it only from the last-layer to a particular layer. (
Ya that is what I actually want. For example, I have started the training with 2 layers, trained them to death :D (loss not reducing anymore), restarted my training with 4 layers, now I want original layers 0,1 not to be having backward pass while new layers 2, 3 (and their sub components like self attention heads etc) to be trained normally. How do I achieve something like that ?
@cutoken Pushed something that is useful for you. But be aware, this might confuse the Adam optimizer (Since the gradients of other layers will be zero)
saw your commit. But is the computations matching the layers ? :thinking: I mean the number of computations done in backward pass is always equal to the number of decoder layers ? Only if they are the commit would work I think @keyvank
Okay that zero grad problem can solved if we just clone the last layer. It's as good as any random value anyway :)
@cutoken No, it's not. you have to calculate how many computations are done from your new layers till the last computation
I think this is the formula:
n=3 + ((10*num_heads) + 12) * num_new_layers
where head_size = embedding_degree / num_heads
Got it. I'm thinking of just storing of the layer number in computation during the creation. That way no need for any calculation and it would reliably work. So call function would take the layer number and layer type. If layer type is decoder and layer number is < start layer/limit we do nothing.
Hmmmm this could work, but I want this library to be general purpose, what you are proposing needs adding GPT logic into the Graph logic. You can just forget about the calcs and use a number like 200, it doesn't need to be accurate hah
Yes. This doesn't make much sense unless we expose this as a functionality in the library. For now if I get it working I'll keep it as a add-on in my branch. My guess is that fine tuning would be faster with frozen layers but I wouldn't know unless I try the experiments :D . I'll keep you posted on the results. If it is indeed faster we can consider it in the main branch.
Update on experimenting with this one:
- It does work in saving time
- Savings however aren't going to be high unless the model is deep
- It has negligible impact on the training quality - will be even better once we remove the need for learning embeddings (through sentencepiece) and positional encodings (through sine and cos) at the input layer.