GPTFast icon indicating copy to clipboard operation
GPTFast copied to clipboard

Model Config settings for Llama-based architectures

Open jamesoneill12 opened this issue 1 year ago • 8 comments

Hi there,

Thanks for creating this repo. I wanted to know what should be for Llama-2-7b-chat-hf if its the below for gpt and opt arhitectures ?

    "gpt": {
        "path_to_blocks": ["transformer", "h"],
        "child_ref_in_parent_forward": ["transformer", "block"],
    },
    "opt": {
        "path_to_blocks": ["model", "decoder", "layers"],
        "child_ref_in_parent_forward": ["model.decoder", "decoder", "decoder_layer"],
    }

I think it's something close to

    "llama": {
        "path_to_blocks": ["model", "layers"],
        "child_ref_in_parent_forward": ["model", "decoder_layer"], 
    }

but running into the following error

File "/GPTFast/Helpers/Class/add_str_as_func.py", line 9, in add_str_as_func func_code = compile(complete_func_str, "", "exec") File "", line 19 input_pos: Optional[torch.Tensor] = None

So the parsing of the code string is somehow getting miscorrectly matched at "decoder_layer". Any help would be appreciated for this to be able to work on the LLama architectures using this code.

jamesoneill12 avatar Apr 05 '24 12:04 jamesoneill12

Hey James, Llama actually already supports static key-value caching natively within transformers. Will put up a fix in the next few days so that models with static key-value caching natively enabled can also integrate into GPTFast.

MDK8888 avatar Apr 05 '24 17:04 MDK8888

Oh that's awesome! Not completely related, but I've noticed meta-llama/LlamaGuard-7b is super fast out of the box for guardrailing (0.09-0.13 second inference for 100 max new tokens with input token length 400 for a single sample on A100 80GB GPU w/ bfloat16 dtype) but I'm not seeing the same on other Llama architectures such as Llama-2-7b-chat-hf. Do you know if some of the Llama arcs have some inference optimization behind the scenes apart from kv caching ?

jamesoneill12 avatar Apr 05 '24 21:04 jamesoneill12

Hey, apologies for the late response-that is very interesting indeed! I would have to investigate how LlamaGuard-7b works under the hood to answer :)

MDK8888 avatar Apr 06 '24 22:04 MDK8888

No problem! That would be great actually, even if it supported in Transformers

jamesoneill12 avatar Apr 07 '24 21:04 jamesoneill12

Hey James, this week is incredibly busy for me. I will do my best to have a new branch with the fixes up this weekend, if not, early next week.

MDK8888 avatar Apr 10 '24 23:04 MDK8888

No problem at all, can't wait for the release!

jamesoneill12 avatar Apr 12 '24 09:04 jamesoneill12

Hey James, I just pushed up my changes on the branch LlamaIntegration. The example for how it works with TinyLlama is under Examples.llama, but I don't have the GPU bandwidth to test on larger models. Let me know if my changes work with the specific Llama model that you had in mind, and I'll fix it asap if not. Thanks once again for pointing this out to me :)

MDK8888 avatar Apr 15 '24 00:04 MDK8888

Fantastic @MDK8888 !! Can't wait to try this out, I'll let you know if there's anything to report on the larger Llama-based architectures.

jamesoneill12 avatar Apr 15 '24 14:04 jamesoneill12