litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

Add OLMo: 1B & 7B

Open rasbt opened this issue 1 year ago • 14 comments

Adds the popular and fully open-source OLMo models by Allen AI.

  • [x] Implement model download
  • [x] Test tokenizer
  • [x] Implement HF checkpoint conversion
  • [x] clean up HF checkpoint conversion
  • [x] Make sure to use the right layer normalization
  • [x] Make sure generate.py produces reasonable outputs
  • [ ] Update download and finetuning docs
  • [ ] Test pretraining
  • [ ] Test finetuning
    • [ ] Full finetuning
    • [ ] LoRA
    • [ ] Adapter
  • [x] Add tests
  • [x] Update README

Fixes #925

rasbt avatar Feb 12 '24 20:02 rasbt

I'm a big stuck with the conversion and would appreciate your advice and ideas @carmocca or @Andrei-Aksionov!

So, here are 3 special things about Olmo:

  1. they used weight tying like in GPT-2: They reuse the WTE weight as the output projection weight. The way they saved the tensors on the Hub though they simply duplicated that tensor so there shouldn't be any action required. When loading the model in HuggingFace, I checked that olmo.model.transformer.wte.weight and olmo.model.transformer.ff_out.weight contain the same tensor. That should be all good here.

  2. They use a non-parametric LayerNorm. I.e., their LayerNorm doesn't have the scale (weight) and shift (bias) parameters. To avoid any code changes just for that model, my workaround is to just use zeros and ones so that these have no effect:

        state_dict[f"transformer.h.{l}.norm_1.weight"] = torch.ones(config.n_embd)
        state_dict[f"transformer.h.{l}.norm_2.weight"] = torch.ones(config.n_embd)
        state_dict[f"transformer.h.{l}.norm_1.bias"] = torch.zeros(config.n_embd)
        state_dict[f"transformer.h.{l}.norm_2.bias"] = torch.zeros(config.n_embd)
  1. The problem is that I'm missing weights ...

The HF version is like this, which is confusing, because the ops are not applied in that "sequential" order as far as I can tell:

OLMoForCausalLM(
  (model): Olmo(
    (transformer): ModuleDict(
      (wte): Embedding(50304, 2048)
      (emb_drop): Dropout(p=0.0, inplace=False)
      (ln_f): LayerNorm()
      (blocks): ModuleList(
        (0-15): 16 x OlmoSequentialBlock(
          (dropout): Dropout(p=0.0, inplace=False)
          (act): SwiGLU()
          (attn_out): Linear(in_features=2048, out_features=2048, bias=False)
          (ff_out): Linear(in_features=8192, out_features=2048, bias=False)
          (rotary_emb): RotaryEmbedding()
          (attn_norm): LayerNorm()
          (ff_norm): LayerNorm()
          (att_proj): Linear(in_features=2048, out_features=6144, bias=False)
          (ff_proj): Linear(in_features=2048, out_features=16384, bias=False)
        )
      )
      (ff_out): Embedding(50304, 2048)
    )
  )
)

Unless I'm wrong, I think what happens is that ff_proj is a placeholder for the mlp F1 and FC2 layers. I.e., the first half is FC1 and the second half is FC2. It's kind of confusing though.

What I am thinking is that we have to split the fc weights, which would avoid us having to write some custom code in the GPT model class:

    weight_map = {
        "model.transformer.wte.weight": "transformer.wte.weight",
        "model.transformer.ff_out.weight": "lm_head.weight",
        "model.transformer.blocks.{}.attn_out.weight": "transformer.h.{}.attn.proj.weight",
        "model.transformer.blocks.{}.ff_proj.weight": "transformer.h.{}.mlp.fc_1.weight", # split into fc1 and fc2
        "model.transformer.blocks.{}.att_proj.weight": "transformer.h.{}.attn.attn.weight",
        "model.transformer.blocks.{}.ff_out.weight": "transformer.h.{}.mlp.proj.weight",
    }
...

    for l in range(config.n_layer):
        state_dict[f"transformer.h.{l}.mlp.fc_2.weight"] = state_dict[f"transformer.h.{l}.mlp.fc_1.weight"][config.n_embd:]
        state_dict[f"transformer.h.{l}.mlp.fc_1.weight"] = state_dict[f"transformer.h.{l}.mlp.fc_1.weight"][:config.n_embd]

is this somehow possible with the lit-gpt SaveProxyTensor?

rasbt avatar Feb 13 '24 20:02 rasbt

Hey! All your suggestions make sense to me. You should be able to split the combined ff linear as you suggest, especially if load_param has ben called already. We also manipulate the qkv linears for llama2 checkpoints in a similar way.

However, note that your workarounds will only work for inference. During training, wte and ff_out will not be tied and the layernorm parameters wont be frozen.

carmocca avatar Feb 13 '24 22:02 carmocca

Hello @rasbt

Looks like you are correct. I just wanted to add a couple of things that I've noticed while reviewing their code. For posterity, sorta speak.

I also don't like when the layers are initialized not in the order they are executed. Lit-GPT also does it: first we create lm_head and only then transformer layers 🙃.

So the order of execution should be as such:

# Attention
1. attn_norm
2. attn_proj (2048, 6144) <- combined QKV
3. rotary_emb
4. attn_out (2048, 2048)
5. dropout
# MLP
1. ff_norm
2. ff_proj (2048, 16384) <- combined [fc_2, fc_1] / [up, gate] in LlaMA notation
3. act
4. ff_out (8192, 2048)
5. dropout
  1. Yes, they use weight_tying. It's configurable and they decided to use it. And yes, it won't work during training. Although it's not difficult to add if more models will use it.
  2. Their LayerNorm class supports weight and bias parameters, but it's controlled by the config. It looks like they turned off .weight and .bias per config.
  3. This deserves a bit more explanation. In LlaMAMLP class we have fc_1, fc_2 and proj. During the forward pass we apply fc_1 and fc_2 on the input separately: https://github.com/Lightning-AI/lit-gpt/blob/f5d68065ff621fc2cc190c05dcc4ab2cda1d1f57/lit_gpt/model.py#L286-L290

Olmo has only two layer: ff_proj and ff_out. They decided to take an approach that is similar to a combined QKV matrix and created ff_proj layer that does this matmul op in one go. But then, the way they split the result is I would say an unexpected - in the activation function:

def forward(self, x: torch.Tensor) -> torch.Tensor:
        x, gate = x.chunk(2, dim=-1)
        return F.silu(gate) * x

and then they apply ff_out to it.

Important to note the way they split and then apply activation function on a chunk. That means that: ff_proj == [fc_2, fc_1] ff_out == proj

Andrei-Aksionov avatar Feb 14 '24 15:02 Andrei-Aksionov

Thanks so much for the feedback @carmocca and @Andrei-Aksionov , this was super helpful! After more tinkering, I went with a custom OLMoMLP (analogous to LLaMALMLP) because I thought this was easier than the other workarounds -- both from an implementation perspective but also code-readability in the future.

The weights load ok now, but for some reason, the results are garbage. E.g., for

python generate/base.py --checkpoint_dir ./checkpoints/allenai/OLMo-1b/

What food do llamas eat?lerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslers

And

python generate/base.py --checkpoint_dir ./checkpoints/allenai/OLMo-7b/

What food do llamas eat? nic except ' Up Area has , climate * new area even county bun dressingall Bul Index millions Di withdrawal intent except / bun ID tonnes approve welcome St/ regimes health ng est African worse Multiple; p; ques Up ( IL'' Area / p

rasbt avatar Feb 14 '24 16:02 rasbt

Yes, they use weight_tying. It's configurable and they decided to use it. And yes, it won't work during training. Although it's not difficult to add if more models will use it.

Actually, upon further inspection they only use weight tying for the 1B (https://huggingface.co/allenai/OLMo-1B/blob/main/config.json#L42) model not for the 7B model (https://huggingface.co/allenai/OLMo-7B/blob/main/config.json#L42). I adjusted the code accordingly. Still not working well though.

rasbt avatar Feb 14 '24 16:02 rasbt

I would strongly prefer that we don't add this new MLP class.

To debug the output, you'll have to inspect the activations for both models layer by layer to see where they diverge

carmocca avatar Feb 14 '24 16:02 carmocca

I would strongly prefer that we don't add this new MLP class.

Ok! Maybe let's leave it in there until we got it to work, and then we can refactor it into one of the existing classes somehow.

rasbt avatar Feb 14 '24 16:02 rasbt

Just to add a note about pinpointing the difference. With Carlos's help, we found that the difference currently is in how the QKV matrix is split into queries, keys, and values.

https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/model.py#L195-L202

and

https://github.com/allenai/OLMo/blob/main/olmo/model.py#L687 https://github.com/allenai/OLMo/blob/main/olmo/model.py#L559-L571

In Lit-GPT, the Q, K, and V are interleaved (to also support MQA) whereas in OLMo, QKV are not interleaved.

We could potentially accommodate OLMo in Lit-GPT if we apply the steps here from Llama in the conversion script but in reverse: https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/convert_hf_checkpoint.py#L182-L186

rasbt avatar Feb 14 '24 22:02 rasbt

Hi there! What is the status on olmo support? I recognize how this PR relates to https://github.com/Lightning-AI/litgpt/pull/1013 relating to the qkv implementation.

jwkirchenbauer avatar Jun 29 '24 18:06 jwkirchenbauer

Hello @jwkirchenbauer I think #1341 is architectually similar to Olmo, so after it is merged, it should be much easier to implement olmo. Somewhere next week, maybe. Depends on when that PR is merged, plus, I guess Gemma 2 is in priority right now. But I'll do my best to implement it as soon as possible since there is still interest 😉.

#1013 is a breaking change, so it might take time for reviewers to do their part.

Andrei-Aksionov avatar Jun 29 '24 18:06 Andrei-Aksionov

If this PR gets revived some time, we should check out the qkv_reassemble function from #1341

rasbt avatar Jul 01 '24 16:07 rasbt

A tale of a PR that rose from ashes. 🙂

I tested both model, they generate normally looking text, though cannot answer a question. As I understand, these models aren't finetuned ones, so that's most likely why.

@rasbt I didn't try to finetune it, but it looks like there are no significant architectural changes, so it should be ok.

cc @jwkirchenbauer Finally made time.

Andrei-Aksionov avatar Jul 28 '24 18:07 Andrei-Aksionov

Wow thanks for resurrecting it and pushing it forward!

rasbt avatar Jul 29 '24 19:07 rasbt

It looks like the instruct variant expects a prompt in a specific format:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B-Instruct-hf")

message = [{"role": "user", "content": "{prompt}"}]
display(tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True))
"<|endoftext|><|user|>\n{prompt}\n<|assistant|>\n"

Andrei-Aksionov avatar Jul 30 '24 08:07 Andrei-Aksionov