litgpt Custom 4k context length supporting and converting model config to huggingface supportted config file

Thanks for your brilliant work!

I would like to train a lit-gpt model with a context length of 4096. I want to confirm that the only thing I need to do is to modify the chunk_size (often 2048 by default) key in the config file.

Moreover, is there support to convert the model config (lit-config.json) to the config.json that the huggingface supports? I found that during the hf transformers from_pretrained() call, I required to specify the model_type key and I used the 'llama'.

During inference, I found this warning

Some weights of LlamaForCausalLM were not initialized from the model checkpoint at ./out/tiny_LLaMA_1b_4k_3epoch_8gpus_warpup0_lr1e-
4_grad1_config_save_test and are newly initialized: ['model.layers.25.mlp.gate_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.26.self_att
n.v_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.29.post_attention_layernorm.weight', 'model.layers.31.mlp.down_proj.weight', 'mod
el.layers.24.self_attn.o_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.29.self_attn.q_proj.we
ight', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.24.input_layernorm.weight', 'model.layers.23.input_layernorm.weight', 'model.layers.24.self_
attn.k_proj.weight', 'model.layers.28.post_attention_layernorm.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model
.layers.23.mlp.up_proj.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.30.post_attention_layernorm.weight', 'model.layers.24.mlp.down_proj.we
ight', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.26.post_attention_layernorm.weight', 'model.layers.25.input_layernorm.weight', 'model.layers
.22.mlp.up_proj.weight', 'model.layers.31.input_layernorm.weight', 'model.layers.23.post_attention_layernorm.weight', 'model.layers.31.self_attn.q_proj.weig
ht', 'model.layers.31.self_attn.o_proj.weight', 'model.layers.24.post_attention_layernorm.weight', 'model.layers.25.self_attn.o_proj.weight', 'model.layers.
25.self_attn.k_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.27.mlp.down_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.
layers.28.input_layernorm.weight', 'model.layers.26.self_attn.k_proj.weight', 'model.layers.30.mlp.up_proj.weight', 'model.layers.28.mlp.down_proj.weight',
'model.layers.30.self_attn.o_proj.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.27.self_attn.k_proj.weight', 'model.layers.28.self_attn.
k_proj.weight', 'model.layers.29.self_attn.o_proj.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.30.input_layernorm.weight', 'model.layers
.30.self_attn.v_proj.weight', 'model.layers.30.self_attn.q_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.24.mlp.gate_proj.weight',
'model.layers.22.self_attn.q_proj.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.30.mlp.down_proj.weight', 'model.layers.29.mlp.gate_proj.weig
ht', 'model.layers.29.self_attn.v_proj.weight', 'model.layers.30.self_attn.k_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.23.self_
attn.k_proj.weight', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.30.mlp.gate_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layer
s.22.self_attn.k_proj.weight', 'model.layers.29.mlp.up_proj.weight', 'model.layers.28.self_attn.q_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model
.layers.31.mlp.up_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.28.self_attn.o_proj.weight', 'model.layers.22.post_attention_layernorm
.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.31.self_attn.v_proj.weight', 'model.layers.28.
self_attn.v_proj.weight', 'model.layers.29.input_layernorm.weight', 'model.layers.31.post_attention_layernorm.weight', 'model.layers.29.mlp.down_proj.weight
', 'model.layers.31.self_attn.k_proj.weight', 'model.layers.27.self_attn.v_proj.weight', 'model.layers.27.input_layernorm.weight', 'model.layers.22.mlp.down
_proj.weight', 'model.layers.25.post_attention_layernorm.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.24.self_attn.v_proj.weight', 'model.
layers.31.mlp.gate_proj.weight', 'model.layers.22.self_attn.o_proj.weight', 'model.layers.27.self_attn.o_proj.weight', 'model.layers.23.self_attn.v_proj.wei
ght', 'model.layers.25.mlp.down_proj.weight', 'model.layers.22.input_layernorm.weight', 'model.layers.27.post_attention_layernorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at ./out/tiny_LLaMA_1b_4k_math_textbooks_markdown_3epoch_8gpus_warpup0_lr1e-
4_grad1_config_save_test and are newly initialized because the shapes did not match:
- lm_head.weight: found shape torch.Size([32000, 2048]) in the checkpoint and torch.Size([32000, 4096]) in the model instantiated
- model.embed_tokens.weight: found shape torch.Size([32000, 2048]) in the checkpoint and torch.Size([32000, 4096]) in the model instantiated
- model.layers.0.input_layernorm.weight: found shape torch.Size([2048]) in the checkpoint and torch.Size([4096]) in the model instantiated
- model.layers.0.self_attn.q_proj.weight: found shape torch.Size([2048, 2048]) in the checkpoint and torch.Size([4096, 4096]) in the model instantiated
- model.layers.0.self_attn.k_proj.weight: found shape torch.Size([256, 2048]) in the checkpoint and torch.Size([4096, 4096]) in the model instantiated
- model.layers.0.self_attn.v_proj.weight: found shape torch.Size([256, 2048]) in the checkpoint and torch.Size([4096, 4096]) in the model instantiated
- model.layers.0.self_attn.o_proj.weight: found shape torch.Size([2048, 2048]) in the checkpoint and torch.Size([4096, 4096]) in the model instantiated
- model.layers.0.post_attention_layernorm.weight: found shape torch.Size([2048]) in the checkpoint and torch.Size([4096]) in the model instantiated
- model.layers.0.mlp.gate_proj.weight: found shape torch.Size([5632, 2048]) in the checkpoint and torch.Size([5632, 4096]) in the model instantiated
- model.layers.0.mlp.up_proj.weight: found shape torch.Size([5632, 2048]) in the checkpoint and torch.Size([5632, 4096]) in the model instantiated

...


- model.layers.21.self_attn.o_proj.weight: found shape torch.Size([2048, 2048]) in the checkpoint and torch.Size([4096, 4096]) in the model instantiated
- model.layers.21.post_attention_layernorm.weight: found shape torch.Size([2048]) in the checkpoint and torch.Size([4096]) in the model instantiated
- model.layers.21.mlp.gate_proj.weight: found shape torch.Size([5632, 2048]) in the checkpoint and torch.Size([5632, 4096]) in the model instantiated
- model.layers.21.mlp.up_proj.weight: found shape torch.Size([5632, 2048]) in the checkpoint and torch.Size([5632, 4096]) in the model instantiated
- model.layers.21.mlp.down_proj.weight: found shape torch.Size([2048, 5632]) in the checkpoint and torch.Size([4096, 5632]) in the model instantiated
- model.norm.weight: found shape torch.Size([2048]) in the checkpoint and torch.Size([4096]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Here is my model config and saved model config file used for hf transformers, respectively.

org="StatNLP-research",
        name="tiny_LLaMA_1b_4k",
        block_size=4096,
        vocab_size=32000,
        padding_multiple=64,
        n_layer=22,
        n_head=32,
        n_embd=2048,
        rotary_percentage=1.0,
        parallel_residual=False,
        bias=False,
        _norm_class="FusedRMSNorm",
        norm_eps=1e-5, #Llama 2 use 1e-5. Llama 1 use 1e-6
        _mlp_class="LLaMAMLP",
        intermediate_size=5632,
        n_query_groups=4,

{
    "name": "tiny_LLaMA_1b_4k",
    "model_type": "llama",
    "block_size": 4096,
    "max_position_embeddings": 4096,
    "vocab_size": 32000,
    "padding_multiple": 64,
    "padded_vocab_size": 32000,
    "n_layer": 22,
    "n_head": 32,
    "n_embd": 2048,
    "rotary_percentage": 1.0,
    "parallel_residual": false,
    "bias": false,
    "n_query_groups": 4,
    "shared_attention_norm": false,
    "_norm_class": "FusedRMSNorm",
    "norm_eps": 1e-05,
    "_mlp_class": "LLaMAMLP",
    "intermediate_size": 5632,
    "condense_ratio": 1
}

Looking forward to your reply. Thanks in advance.

Oct 24 '23 08:10 SinclairCoder

I would like to train a lit-gpt model with a context length of 4096. I want to confirm that the only thing I need to do is to modify the chunk_size (often 2048 by default) key in the config file.

Yes!

Moreover, is there support to convert the model config (lit-config.json) to the config.json that the huggingface supports? I found that during the hf transformers from_pretrained() call, I required to specify the model_type key and I used the 'llama'.

No, it has to be converted manually. HF's config files are freeform so it wouldn't be reliable for all checkpoints.

Are you doing the HF conversion with https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/convert_lit_checkpoint.py?

Nov 18 '23 22:11 carmocca

@SinclairCoder, has your problem been resolved?

Dec 29 '23 20:12 murdadesmaeeli