lit-llama Adapter small fix

Hi there 👋

As @carmocca mentioned in PR #352 some code changes need to be done:

Change self.n_embd --> C since this value is extracted from the shape of input variable x right in the beginning of the forward method.
Prettify reshaping of prefix.
This one is a biggie: vocab_size --> padded_vocab_size to align it with lit_llama/model.py. I assume after this checkpoints won't go south since this is just an expansion in size for better performance (I believe up to 25%). With shrinkage it would be a whole another story.

Jun 02 '23 14:06 Andrei-Aksionov

Despite seeing all the green lights for merging don't do it just now. Tomorrow I want to check how weights are copied to be extra sure. Right now I don't feel confident that all weights for adapter will be copied without issues and that the model will behave as expected.

Jun 02 '23 19:06 Andrei-Aksionov

Thanks. Before we land this, I'd like to run the finetuning to make sure it is still training as expected. I'll do that in the next day or so.

Jun 03 '23 01:06 awaelchli

I don't have a GPU (yeah, I know 😄 ) so I want to excuse myself in advance for any stupid questions/suggestions. Basically the problem that I wasn't able to test my suspicions with the checkpoints for this repo.

Everything should work fine simply because, as you can see from the open_llama_7b_700bt_preview config, the vocab_size is 32k which is a multiple of 64 (32k/64=500).
But of course if vocab_size != padded_vocab_size then loading of pretrained weights should fail: https://github.com/Lightning-AI/lit-llama/blob/99695716396eed07245367348a5382e73fad8834/finetune/adapter.py#L94 load_state_dict will not try to fill first n elements out of m (n < m where n - size of pretrained weights, m - new size). What do I mean: a) for embeddings if old size was 100 and now we pad the size up to 128 then we can simply fill first 100 rows in the embedding table and it will be fine 'cause this number (100) is defined by tokenizer and thus elements after max tokenizer index will not be used. So the remaining 28 elements might be initialized with any numbers. b) almost the same is true for the lm_head: the only thing remaining weights (28) are needed to be initialized with zeros: logits for these non-existing tokens will be 0, after soft-max probabilites will be also 0 and thus all these 28 tokens during sampling won't be used. But, big but, load_state_dict doesn't do this as I can see. With pretrained weights it's fine, but if someone trained model from scratch and then such changes are introduced then old checkpoints are useless.

I feel like you have already knew/discussed this, nevertheless I wanted to mention it.

By the way: padding up to nearest multiple of 64 in my opinion is useful only for lm_head. With embeddings it's basically and indexing so I don't see how we can gain performance from it. In nanoGPT repo it was done for both embeddings and lm_head because of weight tying --> weights are shared --> the same shape is needed during init process.

Jun 03 '23 12:06 Andrei-Aksionov

Hello @awaelchli

Before we land this, I'd like to run the finetuning to make sure it is still training as expected.

Any luck with this?

Jun 12 '23 09:06 Andrei-Aksionov

lit-llama lit-llama copied to clipboard

Adapter small fix

lit-llama
lit-llama copied to clipboard