ggml icon indicating copy to clipboard operation
ggml copied to clipboard

GGUF file format specification

Open philpax opened this issue 2 years ago • 28 comments

Closes #220.

Rendered: https://github.com/philpax/ggml/blob/gguf-spec/docs/gguf.md

Defines a complete specification for the proposed GGUF file format, which should generically describe models to be loaded by any compatible executor.

This is a first draft, so there's still some work that needs to be done - I need to fill in the TODOs and clarify a few things. If you have any suggestions for what should go in the TODOs, please let me know!

Changes from the version in the issue include:

  • changing of several of the key-value pairs, including splitting them out into per-architecture key-values
  • decoupling tensor info from tensor data, and aligning both
  • moving the embedded vocabulary into the metadata, so that it is no longer special-cased

philpax avatar Jun 25 '23 22:06 philpax

general question, is this format only for LLMs ? what about vision stuff and multiple models in one file? eg. https://github.com/monatis/clip.cpp does that.

Nope! LLMs are just the use-case I'm familiar with. We should describe whisper.cpp here and discuss standardising the others too (this is the first I've heard of clip.cpp, that's really cool). Do you have links to other GGML-using projects that aren't LLMs?

philpax avatar Jun 26 '23 08:06 philpax

I'm afraid defining a closed set of metadata vocabulary might be a restricting design that hinders the speed of innovations in the GGML community. My suggestion would be define a certain format to encode freeform key-value pairs:

One possible way might be

ggml_magic
number_of_pairs
[
    (key_length, key, value_type, value)
    ...
]

value_type can be used to indicate if it's an integer (e.g., value_type=0) or length of string if value_type > 0. Then we can define a function that extracts metadata from a given file easily. This is only a morning idea, but the whole idea is we need to define the format, not the content.

Almost anything can be reduced to this type of key-value pairs. If needed, we can extended to a nested structure as well, but I believe that the metadata keys should be open and no model-specific metadata should be defined.

The GGML manifesto states that "The AI models are improving at a very high rate and it is important to stay on top of it." and I think we must not define such keys in order to stay on top of improvements in AI.

monatis avatar Jun 26 '23 09:06 monatis

I'm afraid defining a closed set of metadata vocabulary might be a restricting design that hinders the speed of innovations in the GGML community. My suggestion would be define a certain format to encode freeform key-value pairs:

One possible way might be

ggml_magic
number_of_pairs
[
    (key_length, key, value_type, value)
    ...
]

value_type can be used to indicate if it's an integer (e.g., value_type=0) or length of string if value_type > 0. Then we can define a function that extracts metadata from a given file easily. This is only a morning idea, but the whole idea is we need to define the format, not the content.

Almost anything can be reduced to this type of key-value pairs. If needed, we can extended to a nested structure as well, but I believe that the metadata keys should be open and no model-specific metadata should be defined.

The GGML manifesto states that "The AI models are improving at a very high rate and it is important to stay on top of it." and I think we must not define such keys in order to stay on top of improvements in AI.

Yes, that's what's described in the spec. It's not a closed set; the keys that are specified are standardized and guaranteed to always share the same meaning, but users can extend it with their own as required to serve their needs. Ideally, the more popular KVs would end up being standardized as well.

philpax avatar Jun 26 '23 10:06 philpax

Do you have links to other GGML-using projects that aren't LLMs?

check out the README :) https://github.com/ggerganov/ggml#updates

Green-Sky avatar Jun 26 '23 10:06 Green-Sky

I've addressed the review comments 👍

Just asking the people behind each implementation: can you suggest metadata values that should be standardized, if any?

  • @ggerganov: whisper.cpp
  • @monatis: clip.cpp
  • @saharNooby: rwkv.cpp
  • @PABannier: biogpt.cpp/encodec.cpp
  • @skeskinen: bert.cpp

philpax avatar Jun 26 '23 23:06 philpax

How is this spec relating to LoRa (ggla), I don't see it mentioned anywhere. @slaren

Green-Sky avatar Jun 27 '23 10:06 Green-Sky

How is this spec relating to LoRa (ggla), I don't see it mentioned anywhere. @slaren

Good spot, I actually noticed that this morning and hadn't updated it. What should it look like? I imagine that you want it to

  • match an existing model exactly, so that it can't be misapplied
  • be marked as a LoRA

Maybe a subset of the fields of the original LLM, with a general.lora = true field?

philpax avatar Jun 27 '23 12:06 philpax

The LoRA files are very simple currently, it's just a tiny header with a few parameters and a bunch of tensors. I think it should work fine with the way this is designed currently.

The only parameters stored in the header currently are the rank and alpha values of the LoRA. This is not enough to support every type of LoRA, so I wouldn't bother with defining this in a very detailed way for now, we can look into it later.

slaren avatar Jun 27 '23 14:06 slaren

What is the difference between max_seq_len and context_length? Isn't both the maximum usable/recommended context length?

klosax avatar Jun 27 '23 18:06 klosax

I suggest use of special key-values to identify special tokens:

tokenizer.bos_token_id Beginning of sequence marker tokenizer.eos_token_id End of sequence marker tokenizer.unk_token_id Unknown token tokenizer.sep_token_id Separator token tokenizer.pad_token_id Padding token

klosax avatar Jun 27 '23 19:06 klosax

What is the difference between max_seq_len and context_length? Isn't both the maximum usable/recommended context length?

There is no difference, I suppose it's just came into existence because the Falcon implementation was derived from MPT/Replit, which also has this naming.

jploski avatar Jun 27 '23 20:06 jploski

Updated with latest round of feedback.

philpax avatar Jun 27 '23 21:06 philpax

Updated with latest round of feedback.

Note that @saharNooby and myself are maintainer and contributor (respectively) to a popular RWKV inference library RWKV.cpp so the parameters we proposed are indeed the ones that are needed to properly inference with the model. You could add them without much trouble

LoganDark avatar Jun 27 '23 22:06 LoganDark

Updated with latest round of feedback.

Note that @saharNooby and myself are maintainer and contributor (respectively) to a popular RWKV inference library RWKV.cpp so the parameters we proposed are indeed the ones that are needed to properly inference with the model. You could add them without much trouble

Oh, no, I know this; I was just giving you two an opportunity to agree on what the names of those fields should be before I wrote anything up.

philpax avatar Jun 27 '23 22:06 philpax

I suggest use of special key-values to identify special tokens:

tokenizer.bos_token_id Beginning of sequence marker
tokenizer.eos_token_id End of sequence marker
tokenizer.unk_token_id Unknown token
tokenizer.sep_token_id Separator token
tokenizer.pad_token_id Padding token

Some models have special tokens for separating two (or more) sides of a chat conversation—OpenAI is one example of a company that trains models like this, in an attempt to disallow the "user" from performing prompt injections, by giving the "system" higher authority. How would this be represented?

LoganDark avatar Jun 28 '23 23:06 LoganDark

I suggest use of special key-values to identify special tokens: tokenizer.bos_token_id Beginning of sequence marker tokenizer.eos_token_id End of sequence marker tokenizer.unk_token_id Unknown token tokenizer.sep_token_id Separator token tokenizer.pad_token_id Padding token

Some models have special tokens for separating two (or more) sides of a chat conversation—OpenAI is one example of a company that trains models like this, in an attempt to disallow the "user" from performing prompt injections, by giving the "system" higher authority. How would this be represented?

That could potentially be something for the prompting section, which I've left undefined for now as I need to see what the current breadth of prompting strategies is.

philpax avatar Jun 29 '23 07:06 philpax

I suggest use of special key-values to identify special tokens: tokenizer.bos_token_id Beginning of sequence marker tokenizer.eos_token_id End of sequence marker tokenizer.unk_token_id Unknown token tokenizer.sep_token_id Separator token tokenizer.pad_token_id Padding token

Some models have special tokens for separating two (or more) sides of a chat conversation—OpenAI is one example of a company that trains models like this, in an attempt to disallow the "user" from performing prompt injections, by giving the "system" higher authority. How would this be represented?

That could potentially be something for the prompting section, which I've left undefined for now as I need to see what the current breadth of prompting strategies is.

Those would still have associated tokens, like delimiters at the very least. Those token IDs would be useful to know for models which do this. I'm not aware of any to compare with, other then the fact that OpenAI uses something like

<|role_name|>system<|role_message|>This is a system message.<|role_end|><|role_name|>user<|role_message>hello, ChatGPT!<|role_end|><|endoftext|>

LoganDark avatar Jun 29 '23 12:06 LoganDark

I suggest use of special key-values to identify special tokens: tokenizer.bos_token_id Beginning of sequence marker tokenizer.eos_token_id End of sequence marker tokenizer.unk_token_id Unknown token tokenizer.sep_token_id Separator token tokenizer.pad_token_id Padding token

Some models have special tokens for separating two (or more) sides of a chat conversation—OpenAI is one example of a company that trains models like this, in an attempt to disallow the "user" from performing prompt injections, by giving the "system" higher authority. How would this be represented?

That could potentially be something for the prompting section, which I've left undefined for now as I need to see what the current breadth of prompting strategies is.

Those would still have associated tokens, like delimiters at the very least. Those token IDs would be useful to know for models which do this. I'm not aware of any to compare with, other then the fact that OpenAI uses something like

<|role_name|>system<|role_message|>This is a system message.<|role_end|><|role_name|>user<|role_message>hello, ChatGPT!<|role_end|><|endoftext|>

Aye, I was thinking something like

mpt.prompting.type = "conversational_system"
mpt.prompting.conversational_system.role_name_token_id = "<|role_name|>"
# ...

but I'm reticient to nail anything down now, because the space is still very much being explored. The thinking here is that you'd keep the tokens near the place where they're relevant.

I'm thinking that might be something we revisit in some time, just so we can wait for things to shake out a bit and see what kind of standardisation makes sense.

philpax avatar Jun 29 '23 23:06 philpax

Why not use a less cryptic key naming?

[llm].hidden_size --> [llm].embedding_length [llm].n_ff --> [llm].feedforward_length

[llm].n_layers --> [llm].num_layers [llm].attention.n_heads --> [llm].attention.num_heads [llm].rope.n_dims --> [llm].rope.num_dims

or even better change n_ and num_ to _count

[llm].n_layers --> [llm].layer_count [llm].attention.n_heads --> [llm].attention.head_count [llm].rope.n_dims --> [llm].rope.dimension_count

klosax avatar Jun 30 '23 08:06 klosax

Why not use a less cryptic key naming?

Heavy +1 for @klosax suggestion. I guess the only reason for original key names is to keep them consistent with llama.cpp code and general naming style in ggml.

saharNooby avatar Jun 30 '23 09:06 saharNooby

Why not use a less cryptic key naming?

Heavy +1 for @klosax suggestion. I guess the only reason for original key names is to keep them consistent with llama.cpp code and general naming style in ggml.

Which in turn depends on the naming style in config.json / the HuggingFace transformers library. I would suggest to not invent new names (however descriptive) for concepts that have been already named in a different way previously. If possible, we don't want to pollute the world n different names for the same thing (which is unfortunately already happening in HuggingFace transformer implementations, but at least we could try not to add to the misery). (I also like short names - will take n_dims versus dimension_count any day.)

jploski avatar Jun 30 '23 09:06 jploski

Why not use a less cryptic key naming?

[llm].hidden_size --> [llm].embedding_length [llm].n_ff --> [llm].feedforward_length

[llm].n_layers --> [llm].num_layers [llm].attention.n_heads --> [llm].attention.num_heads [llm].rope.n_dims --> [llm].rope.num_dims

or even better change n_ and num_ to _count

[llm].n_layers --> [llm].layer_count [llm].attention.n_heads --> [llm].attention.head_count [llm].rope.n_dims --> [llm].rope.dimension_count

I personally prefer this (see my original proposal in #220), but people have requested compatibility with the existing GGML naming, and there's no consistency in the Hugging Face transformer keys, either. Let's see what others think about establishing our own less-cryptic naming.

(I also like short names - will take n_dims versus dimension_count any day.)

I generally try to match variable name length to the size of their scope / how long they'll be around. These models could potentially be around for months or for years - I would personally (not necessarily as the author of this PR) prefer something that's self-evident that developers can then pull from and name as they like in their own code.

philpax avatar Jun 30 '23 10:06 philpax

(I also like short names - will take n_dims versus dimension_count any day.)

I generally try to match variable name length to the size of their scope / how long they'll be around. These models could potentially be around for months or for years - I would personally (not necessarily as the author of this PR) prefer something that's self-evident that developers can then pull from and name as they like in their own code.

Yes, that is a valid consideration.

Another possibility would be to have short "backward-compatible" names accompanied by longer descriptions; kind of like you can have short command line parameter names and a --help option explaining what they mean. You also have to keep in mind the target audience - people with experience with GGML or Python implementations may prefer something like "n_dims", while inexperienced users may prefer more enlightening/educational names. However, given the subject nature, it is doubtful whether just spelling out a name explains it enough for anyone who doesn't already know what is meant to gain an understanding. So it might be that the only goal should be to have the names so that they are not confused with each other (and leave the rest to longer descriptions / documentation).

jploski avatar Jun 30 '23 11:06 jploski

huggingface transformers actually named n_ff intermediate_size. not sure what would be better.

strong +1 for @klosax naming scheme from me.

suggestion: don't use a single n_ for number vars, use atleat num_ , since there are multiple things starting with an n (like normalization)

Green-Sky avatar Jun 30 '23 11:06 Green-Sky

I tend to prefer _count instead of num_ as in gguf_header_t:

    uint32_t tensor_count;
    uint32_t metadata_kv_count;
gguf_tensor_info_t:

uint32_t n_dimensions;  --> uint32 dimension_count;
uint32_t dimensions[n_dimensions]; --> uint32 dimensions[dimension_count];
uint32_t n_elements; --> uint32 element_count;

klosax avatar Jul 02 '23 14:07 klosax

More descriptive: [llm].rope.scale --> [llm].rope.context_scale

klosax avatar Jul 02 '23 14:07 klosax

@philpax

Parameters for whisper.cpp:

    whisper.encoder.n_ctx: u32             // (a.k.a. `whisper.encoder.context_length`)
    whisper.encoder.n_embd: u32            // (a.k.a. `whisper.encoder.hidden_size`)
    whisper.encoder.n_layers: u32
    whisper.encoder.n_mels: u32
    whisper.encoder.attn.n_heads: u32      // (a.k.a. `whisper.encoder.attention.n_heads`)
    whisper.decoder.n_ctx: u32             // (a.k.a. `whisper.decoder.context_length`)
    whisper.decoder.n_embd: u32            // (a.k.a. `whisper.decoder.hidden_size`)
    whisper.decoder.n_layers: u32
    whisper.decoder.attn.n_heads: u32      // (a.k.a. `whisper.decoder.attention.n_heads`)

Regarding the naming convention (i.e. more descriptive or shorter, etc) - it will be difficult to find a consensus about the best naming strategy. I don't mind either way we choose, though my preference is as shown above.

ggerganov avatar Jul 02 '23 17:07 ggerganov

OK, I'll apply the changes discussed soon. I'm going to pick the longer names but include the shorter names in their description, so that people can port code. My reasoning is that if the names are descriptive, the reader will be less likely to need to read the description to need to know what it is at a glance.

philpax avatar Jul 02 '23 23:07 philpax

Sorry about the delay, have been busy. I've updated the spec to reflect our discussion, and to include whisper.cpp. I think it's in a pretty good place now.

I think we'll need the following to implement/validate the spec:

  • Python script to convert a Hugging Face model to a valid GGUF
  • Code to load a GGUF in ggml and use it with the existing implementations

Luckily, @klosax already did these for v1 of the spec! Hopefully, we can just update this code and we should be good to go. (klosax, is that something you'd be interested in doing?)


From llm's side, we'll implement GGUF support soon-ish, but we should probably have a common set of test models first.

We could write the migration program proposed in the spec as we have access to a greater range of functionality, which could also potentially be used as a source of models. Not sure if that should live with ggml/llama.cpp, though. (It's easy enough for us to provide binaries for the major platforms.)

philpax avatar Jul 09 '23 22:07 philpax

Luckily, @klosax already did these for v1 of the spec! Hopefully, we can just update this code and we should be good to go.

I think this should be implemented in llama.cpp . Other models could be added to llama.cpp . My implementation does only have basic examples and does not have all the great features of llama.cpp .

klosax avatar Jul 11 '23 17:07 klosax