lit-llama Conversion to ggml format

Could you provide a script to convert a model from the Lit-LLaMA format to the original format, so that it can be used in llamacpp? The Lit-LLaMA format is not supported by llamacpp.

The /scripts/convert_hg_checkpoint.py rename some layers transformer.* and reshape others (Turn [Q1, K1, V1, Q2, K2, V2, ...] into [Q1, Q2, ..., K1, K2, .., V1, V2, ...]. This breaks the conversion with llamacpp

May 24 '23 11:05 H4dr1en

Another option would be a conversion to HF format (already requested in https://github.com/Lightning-AI/lit-llama/issues/150) since the ggml conversion supports it already: https://github.com/ggerganov/llama.cpp/blob/ac7876ac20124a15a44fd6317721ff1aa2538806/convert.py#L594

May 24 '23 15:05 carmocca

Yes that would work as well. Why not use that format in the first place (why introduce a format specific to that repo?)

May 25 '23 14:05 H4dr1en

The format is defined by the nn.Module definition. Since we provide our own implementation, the keys are different.

May 25 '23 15:05 carmocca

It is fairly easy to convert the weights to the format that will work will llama.cpp.

Just do the exact opposite of what this script does: https://github.com/Lightning-AI/lit-llama/blob/main/scripts/convert_checkpoint.py

For my case, I have fine-tuned using lit-llama lora, merged the weights, converted it back, quantized with llama.cpp and it works like a charm :)

May 26 '23 01:05 sanjarbek16

Hi @sanjarbek16, thanks for sharing your experience. I would like to do something similar to what you have done. Would you mind sharing the script you used to convert it back? Thanks in advance!

May 29 '23 15:05 joaopalotti

import gc
import torch
from pathlib import Path
from typing import Dict

def reverse_convert_state_dict(state_dict: Dict[str, torch.Tensor], dtype: torch.dtype = torch.float32) -> Dict[str, torch.Tensor]:
    reversed_dict = {}
    reversed_dict["tok_embeddings.weight"] = state_dict["transformer.wte.weight"].to(dtype)
    reversed_dict["output.weight"] = state_dict["lm_head.weight"].to(dtype)
    reversed_dict["norm.weight"] = state_dict["transformer.ln_f.scale"].to(dtype)

    for layer_idx in sorted(set([k.split(".")[2] for k in state_dict if k.startswith("transformer.h")])):
        # attention
        c_attn_weight = state_dict[f"transformer.h.{layer_idx}.attn.c_attn.weight"].to(dtype)
        c_attn_len = c_attn_weight.shape[0] // 3
        reversed_dict[f"layers.{layer_idx}.attention.wq.weight"] = c_attn_weight[:c_attn_len]
        reversed_dict[f"layers.{layer_idx}.attention.wk.weight"] = c_attn_weight[c_attn_len:2*c_attn_len]
        reversed_dict[f"layers.{layer_idx}.attention.wv.weight"] = c_attn_weight[2*c_attn_len:]

        reversed_dict[f"layers.{layer_idx}.attention.wo.weight"] = state_dict[
            f"transformer.h.{layer_idx}.attn.c_proj.weight"
        ].to(dtype)
        # mlp
        reversed_dict[f"layers.{layer_idx}.feed_forward.w1.weight"] = state_dict[
            f"transformer.h.{layer_idx}.mlp.c_fc1.weight"
        ].to(dtype)
        reversed_dict[f"layers.{layer_idx}.feed_forward.w2.weight"] = state_dict[
            f"transformer.h.{layer_idx}.mlp.c_proj.weight"
        ].to(dtype)
        reversed_dict[f"layers.{layer_idx}.feed_forward.w3.weight"] = state_dict[
            f"transformer.h.{layer_idx}.mlp.c_fc2.weight"
        ].to(dtype)
        # rms norm
        reversed_dict[f"layers.{layer_idx}.attention_norm.weight"] = state_dict[f"transformer.h.{layer_idx}.rms_1.scale"].to(dtype)
        reversed_dict[f"layers.{layer_idx}.ffn_norm.weight"] = state_dict[f"transformer.h.{layer_idx}.rms_2.scale"].to(dtype)
    return reversed_dict

def reverse_meta_weights_for_nano_model(
    *,
    input_dir: Path = Path("checkpoints/merged"),
    output_dir: Path = Path("checkpoints/merged/reversed_model/"),
    model_size: str = "7B",
    dtype: str = "float32",
) -> None:
    # input_dir = input_dir / model_size
    # output_dir = output_dir / model_size
    output_dir.mkdir(parents=True, exist_ok=True)

    dt = getattr(torch, dtype, None)
    if not isinstance(dt, torch.dtype):
        raise ValueError(f"{dtype} is not a valid dtype.")
    dtype = dt

    # Load the converted checkpoint
    converted_checkpoint = torch.load(input_dir, map_location="cpu")

    # Reverse the conversion
    reversed_checkpoint = reverse_convert_state_dict(converted_checkpoint, dtype=dtype)

    # Save the reversed checkpoint
    torch.save(reversed_checkpoint, output_dir / "consolidated.00.pth")

    # del converted_checkpoint
    # del reversed_checkpoint
    gc.collect()

if __name__ == "__main__":
    from jsonargparse import CLI

    CLI(reverse_meta_weights_for_nano_model)

The above code worked for me. It does the exact opposite of convert_checkpoint. If you change dtype to float16, the resulting file will be of same size with the original llama weights.

May 30 '23 05:05 sanjarbek16

Thank you very much @sanjarbek16, worked pretty well here as well! 👍

May 30 '23 21:05 joaopalotti

Going to try this on the weekend, you are a life saver @sanjarbek16 !!

Sep 08 '23 23:09 ExcPoint