llm icon indicating copy to clipboard operation
llm copied to clipboard

Supporting Llama-2 70B param

Open AmineDiro opened this issue 11 months ago • 6 comments

Hello, First of all, I want to thank you for this amazing crate! It is truly a joy to work on LLM using Rust 😄 .

I recently wrote an API that serves Llama-2 models using this crate.

I have an issue for serving Llama2-70B-GGML model. The 65B llama and 70B Llama-2 models use grouped query attention. This is done in llama.cpp by specifying the n_gqa params in model hyperparameters which feels a little bit hacky 🤔

I would love to work on adding support for the n_gqa on this crate, I think that it can be added to the Llama model:

/// LLaMA [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))
#[derive(Debug, Default, PartialEq, Eq, Clone, Copy)]
pub struct Hyperparameters {
  ....
   pub n_head_kv: usize,
}

Where the n_head_kv is n_head_kv = n_head / n_gqa; and n_head % n_gqa == 0 and pass the n_gqa parameter in ModelParameters as Optional 🤔

Thank you for you help !

AmineDiro avatar Aug 11 '23 18:08 AmineDiro

If you want you could create a PR containing these changes. The only thing i don't like about it is that the ModelParameters will then contain the n_gqa parameter which is only used by llama models and no other architectures but i guess until GGUF is finalized this is the best way to implement it 🤔

LLukas22 avatar Aug 12 '23 12:08 LLukas22

Yes, I've been considering implementing this but have held off because of the need to parameterise the model just for something that should hopefully be removed in a few weeks.

We actually had a mechanism for this before for this exact purpose, but I removed it after the model format was updated with a breaking change to remove the specific parameter.

I don't think I want to reintroduce it again - especially as we should never need to do this again - but I would like to support 70B at some point. I'm not sure. Maybe a bodge function on the Llama model that lets you set n_head_kv after the fact?

philpax avatar Aug 12 '23 23:08 philpax

I don't think I want to reintroduce it again - especially as we should never need to do this again - but I would like to support 70B at some point. I'm not sure. Maybe a bodge function on the Llama model that lets you set n_head_kv after the fact?

Thanks for the reply! I 100% agree that adding it to ModelParameters feels hacky.

I think that setting it after the fact might be an issue because you need the n_head_kv at load time. Is there a heuristic we can use at load time for llama ggml models to determine that we need a the gqa (for example size of the file path). Then we could set it at a fixed value (n_gqa = 8 for Llama-2 models) and have it only in Hyperparameters struct of Llama 🤔 ?

AmineDiro avatar Aug 14 '23 15:08 AmineDiro

In my oppinion we should just hack it in for now, GGUF seams to be nearly ready meaning when we implement it we can cleanup the implementation. The most important thing is to get the grouped query attention working in the inference code.

LLukas22 avatar Aug 14 '23 17:08 LLukas22

Can this be closed since #409 was merged? Is full support for Llama 2 available now?

liamwh avatar Oct 28 '23 14:10 liamwh

It should work, but it's not ideal, especially as GGUF support hasn't landed yet. It should be OK if you find a 70B model prior to GGUF and pass in the right number of GQA heads.

I'm leaving this open in the meantime as I think we should only really close it when it requires no additional work to use over the smaller models.

philpax avatar Oct 28 '23 17:10 philpax