lit-llama
lit-llama copied to clipboard
Adapter small fix
Hi there 👋
As @carmocca mentioned in PR #352 some code changes need to be done:
- Change
self.n_embd
-->C
since this value is extracted from the shape of input variablex
right in the beginning of the forward method. - Prettify reshaping of prefix.
- This one is a biggie:
vocab_size
-->padded_vocab_size
to align it withlit_llama/model.py
. I assume after this checkpoints won't go south since this is just an expansion in size for better performance (I believe up to 25%). With shrinkage it would be a whole another story.
Despite seeing all the green lights for merging don't do it just now. Tomorrow I want to check how weights are copied to be extra sure. Right now I don't feel confident that all weights for adapter will be copied without issues and that the model will behave as expected.
Thanks. Before we land this, I'd like to run the finetuning to make sure it is still training as expected. I'll do that in the next day or so.
I don't have a GPU (yeah, I know 😄 ) so I want to excuse myself in advance for any stupid questions/suggestions. Basically the problem that I wasn't able to test my suspicions with the checkpoints for this repo.
-
Everything should work fine simply because, as you can see from the
open_llama_7b_700bt_preview
config, thevocab_size
is 32k which is a multiple of 64 (32k/64=500). -
But of course if
vocab_size
!=padded_vocab_size
then loading of pretrained weights should fail: https://github.com/Lightning-AI/lit-llama/blob/99695716396eed07245367348a5382e73fad8834/finetune/adapter.py#L94load_state_dict
will not try to fill firstn
elements out ofm
(n
<m
wheren
- size of pretrained weights,m
- new size). What do I mean: a) for embeddings if old size was 100 and now we pad the size up to 128 then we can simply fill first 100 rows in the embedding table and it will be fine 'cause this number (100) is defined by tokenizer and thus elements after max tokenizer index will not be used. So the remaining 28 elements might be initialized with any numbers. b) almost the same is true for thelm_head
: the only thing remaining weights (28) are needed to be initialized with zeros: logits for these non-existing tokens will be 0, after soft-max probabilites will be also 0 and thus all these 28 tokens during sampling won't be used. But, big but,load_state_dict
doesn't do this as I can see. With pretrained weights it's fine, but if someone trained model from scratch and then such changes are introduced then old checkpoints are useless.
I feel like you have already knew/discussed this, nevertheless I wanted to mention it.
By the way: padding up to nearest multiple of 64 in my opinion is useful only for lm_head
. With embeddings it's basically and indexing so I don't see how we can gain performance from it.
In nanoGPT
repo it was done for both embeddings and lm_head because of weight tying
--> weights are shared --> the same shape is needed during init process.
Hello @awaelchli
Before we land this, I'd like to run the finetuning to make sure it is still training as expected.
Any luck with this?