ggml
ggml copied to clipboard
GGUF file format specification
Closes #220.
Rendered: https://github.com/philpax/ggml/blob/gguf-spec/docs/gguf.md
Defines a complete specification for the proposed GGUF file format, which should generically describe models to be loaded by any compatible executor.
This is a first draft, so there's still some work that needs to be done - I need to fill in the TODOs and clarify a few things. If you have any suggestions for what should go in the TODOs, please let me know!
Changes from the version in the issue include:
- changing of several of the key-value pairs, including splitting them out into per-architecture key-values
- decoupling tensor info from tensor data, and aligning both
- moving the embedded vocabulary into the metadata, so that it is no longer special-cased
general question, is this format only for LLMs ? what about vision stuff and multiple models in one file? eg. https://github.com/monatis/clip.cpp does that.
Nope! LLMs are just the use-case I'm familiar with. We should describe whisper.cpp here and discuss standardising the others too (this is the first I've heard of clip.cpp, that's really cool). Do you have links to other GGML-using projects that aren't LLMs?
I'm afraid defining a closed set of metadata vocabulary might be a restricting design that hinders the speed of innovations in the GGML community. My suggestion would be define a certain format to encode freeform key-value pairs:
One possible way might be
ggml_magic
number_of_pairs
[
(key_length, key, value_type, value)
...
]
value_type can be used to indicate if it's an integer (e.g., value_type=0) or length of string if value_type > 0. Then we can define a function that extracts metadata from a given file easily. This is only a morning idea, but the whole idea is we need to define the format, not the content.
Almost anything can be reduced to this type of key-value pairs. If needed, we can extended to a nested structure as well, but I believe that the metadata keys should be open and no model-specific metadata should be defined.
The GGML manifesto states that "The AI models are improving at a very high rate and it is important to stay on top of it." and I think we must not define such keys in order to stay on top of improvements in AI.
I'm afraid defining a closed set of metadata vocabulary might be a restricting design that hinders the speed of innovations in the GGML community. My suggestion would be define a certain format to encode freeform key-value pairs:
One possible way might be
ggml_magic number_of_pairs [ (key_length, key, value_type, value) ... ]
value_typecan be used to indicate if it's an integer (e.g., value_type=0) or length of string ifvalue_type > 0. Then we can define a function that extracts metadata from a given file easily. This is only a morning idea, but the whole idea is we need to define the format, not the content.Almost anything can be reduced to this type of key-value pairs. If needed, we can extended to a nested structure as well, but I believe that the metadata keys should be open and no model-specific metadata should be defined.
The GGML manifesto states that "The AI models are improving at a very high rate and it is important to stay on top of it." and I think we must not define such keys in order to stay on top of improvements in AI.
Yes, that's what's described in the spec. It's not a closed set; the keys that are specified are standardized and guaranteed to always share the same meaning, but users can extend it with their own as required to serve their needs. Ideally, the more popular KVs would end up being standardized as well.
Do you have links to other GGML-using projects that aren't LLMs?
check out the README :) https://github.com/ggerganov/ggml#updates
I've addressed the review comments 👍
Just asking the people behind each implementation: can you suggest metadata values that should be standardized, if any?
- @ggerganov:
whisper.cpp - @monatis:
clip.cpp - @saharNooby:
rwkv.cpp - @PABannier:
biogpt.cpp/encodec.cpp - @skeskinen:
bert.cpp
How is this spec relating to LoRa (ggla), I don't see it mentioned anywhere. @slaren
How is this spec relating to LoRa (
ggla), I don't see it mentioned anywhere. @slaren
Good spot, I actually noticed that this morning and hadn't updated it. What should it look like? I imagine that you want it to
- match an existing model exactly, so that it can't be misapplied
- be marked as a LoRA
Maybe a subset of the fields of the original LLM, with a general.lora = true field?
The LoRA files are very simple currently, it's just a tiny header with a few parameters and a bunch of tensors. I think it should work fine with the way this is designed currently.
The only parameters stored in the header currently are the rank and alpha values of the LoRA. This is not enough to support every type of LoRA, so I wouldn't bother with defining this in a very detailed way for now, we can look into it later.
What is the difference between max_seq_len and context_length? Isn't both the maximum usable/recommended context length?
I suggest use of special key-values to identify special tokens:
tokenizer.bos_token_id Beginning of sequence marker
tokenizer.eos_token_id End of sequence marker
tokenizer.unk_token_id Unknown token
tokenizer.sep_token_id Separator token
tokenizer.pad_token_id Padding token
What is the difference between
max_seq_lenandcontext_length? Isn't both the maximum usable/recommended context length?
There is no difference, I suppose it's just came into existence because the Falcon implementation was derived from MPT/Replit, which also has this naming.
Updated with latest round of feedback.
Updated with latest round of feedback.
Note that @saharNooby and myself are maintainer and contributor (respectively) to a popular RWKV inference library RWKV.cpp so the parameters we proposed are indeed the ones that are needed to properly inference with the model. You could add them without much trouble
Updated with latest round of feedback.
Note that @saharNooby and myself are maintainer and contributor (respectively) to a popular RWKV inference library RWKV.cpp so the parameters we proposed are indeed the ones that are needed to properly inference with the model. You could add them without much trouble
Oh, no, I know this; I was just giving you two an opportunity to agree on what the names of those fields should be before I wrote anything up.
I suggest use of special key-values to identify special tokens:
tokenizer.bos_token_idBeginning of sequence marker
tokenizer.eos_token_idEnd of sequence marker
tokenizer.unk_token_idUnknown token
tokenizer.sep_token_idSeparator token
tokenizer.pad_token_idPadding token
Some models have special tokens for separating two (or more) sides of a chat conversation—OpenAI is one example of a company that trains models like this, in an attempt to disallow the "user" from performing prompt injections, by giving the "system" higher authority. How would this be represented?
I suggest use of special key-values to identify special tokens:
tokenizer.bos_token_idBeginning of sequence markertokenizer.eos_token_idEnd of sequence markertokenizer.unk_token_idUnknown tokentokenizer.sep_token_idSeparator tokentokenizer.pad_token_idPadding tokenSome models have special tokens for separating two (or more) sides of a chat conversation—OpenAI is one example of a company that trains models like this, in an attempt to disallow the "user" from performing prompt injections, by giving the "system" higher authority. How would this be represented?
That could potentially be something for the prompting section, which I've left undefined for now as I need to see what the current breadth of prompting strategies is.
I suggest use of special key-values to identify special tokens:
tokenizer.bos_token_idBeginning of sequence markertokenizer.eos_token_idEnd of sequence markertokenizer.unk_token_idUnknown tokentokenizer.sep_token_idSeparator tokentokenizer.pad_token_idPadding tokenSome models have special tokens for separating two (or more) sides of a chat conversation—OpenAI is one example of a company that trains models like this, in an attempt to disallow the "user" from performing prompt injections, by giving the "system" higher authority. How would this be represented?
That could potentially be something for the prompting section, which I've left undefined for now as I need to see what the current breadth of prompting strategies is.
Those would still have associated tokens, like delimiters at the very least. Those token IDs would be useful to know for models which do this. I'm not aware of any to compare with, other then the fact that OpenAI uses something like
<|role_name|>system<|role_message|>This is a system message.<|role_end|><|role_name|>user<|role_message>hello, ChatGPT!<|role_end|><|endoftext|>
I suggest use of special key-values to identify special tokens:
tokenizer.bos_token_idBeginning of sequence markertokenizer.eos_token_idEnd of sequence markertokenizer.unk_token_idUnknown tokentokenizer.sep_token_idSeparator tokentokenizer.pad_token_idPadding tokenSome models have special tokens for separating two (or more) sides of a chat conversation—OpenAI is one example of a company that trains models like this, in an attempt to disallow the "user" from performing prompt injections, by giving the "system" higher authority. How would this be represented?
That could potentially be something for the prompting section, which I've left undefined for now as I need to see what the current breadth of prompting strategies is.
Those would still have associated tokens, like delimiters at the very least. Those token IDs would be useful to know for models which do this. I'm not aware of any to compare with, other then the fact that OpenAI uses something like
<|role_name|>system<|role_message|>This is a system message.<|role_end|><|role_name|>user<|role_message>hello, ChatGPT!<|role_end|><|endoftext|>
Aye, I was thinking something like
mpt.prompting.type = "conversational_system"
mpt.prompting.conversational_system.role_name_token_id = "<|role_name|>"
# ...
but I'm reticient to nail anything down now, because the space is still very much being explored. The thinking here is that you'd keep the tokens near the place where they're relevant.
I'm thinking that might be something we revisit in some time, just so we can wait for things to shake out a bit and see what kind of standardisation makes sense.
Why not use a less cryptic key naming?
[llm].hidden_size --> [llm].embedding_length
[llm].n_ff --> [llm].feedforward_length
[llm].n_layers --> [llm].num_layers
[llm].attention.n_heads --> [llm].attention.num_heads
[llm].rope.n_dims --> [llm].rope.num_dims
or even better change n_ and num_ to _count
[llm].n_layers --> [llm].layer_count
[llm].attention.n_heads --> [llm].attention.head_count
[llm].rope.n_dims --> [llm].rope.dimension_count
Why not use a less cryptic key naming?
Heavy +1 for @klosax suggestion. I guess the only reason for original key names is to keep them consistent with llama.cpp code and general naming style in ggml.
Why not use a less cryptic key naming?
Heavy +1 for @klosax suggestion. I guess the only reason for original key names is to keep them consistent with
llama.cppcode and general naming style inggml.
Which in turn depends on the naming style in config.json / the HuggingFace transformers library. I would suggest to not invent new names (however descriptive) for concepts that have been already named in a different way previously. If possible, we don't want to pollute the world n different names for the same thing (which is unfortunately already happening in HuggingFace transformer implementations, but at least we could try not to add to the misery). (I also like short names - will take n_dims versus dimension_count any day.)
Why not use a less cryptic key naming?
[llm].hidden_size --> [llm].embedding_length[llm].n_ff --> [llm].feedforward_length
[llm].n_layers --> [llm].num_layers[llm].attention.n_heads --> [llm].attention.num_heads[llm].rope.n_dims --> [llm].rope.num_dimsor even better change
n_andnum_to_count
[llm].n_layers --> [llm].layer_count[llm].attention.n_heads --> [llm].attention.head_count[llm].rope.n_dims --> [llm].rope.dimension_count
I personally prefer this (see my original proposal in #220), but people have requested compatibility with the existing GGML naming, and there's no consistency in the Hugging Face transformer keys, either. Let's see what others think about establishing our own less-cryptic naming.
(I also like short names - will take n_dims versus dimension_count any day.)
I generally try to match variable name length to the size of their scope / how long they'll be around. These models could potentially be around for months or for years - I would personally (not necessarily as the author of this PR) prefer something that's self-evident that developers can then pull from and name as they like in their own code.
(I also like short names - will take n_dims versus dimension_count any day.)
I generally try to match variable name length to the size of their scope / how long they'll be around. These models could potentially be around for months or for years - I would personally (not necessarily as the author of this PR) prefer something that's self-evident that developers can then pull from and name as they like in their own code.
Yes, that is a valid consideration.
Another possibility would be to have short "backward-compatible" names accompanied by longer descriptions; kind of like you can have short command line parameter names and a --help option explaining what they mean. You also have to keep in mind the target audience - people with experience with GGML or Python implementations may prefer something like "n_dims", while inexperienced users may prefer more enlightening/educational names. However, given the subject nature, it is doubtful whether just spelling out a name explains it enough for anyone who doesn't already know what is meant to gain an understanding. So it might be that the only goal should be to have the names so that they are not confused with each other (and leave the rest to longer descriptions / documentation).
huggingface transformers actually named n_ff intermediate_size. not sure what would be better.
strong +1 for @klosax naming scheme from me.
suggestion: don't use a single n_ for number vars, use atleat num_ , since there are multiple things starting with an n (like normalization)
I tend to prefer _count instead of num_ as in gguf_header_t:
uint32_t tensor_count;
uint32_t metadata_kv_count;
gguf_tensor_info_t:
uint32_t n_dimensions; --> uint32 dimension_count;
uint32_t dimensions[n_dimensions]; --> uint32 dimensions[dimension_count];
uint32_t n_elements; --> uint32 element_count;
More descriptive:
[llm].rope.scale --> [llm].rope.context_scale
@philpax
Parameters for whisper.cpp:
whisper.encoder.n_ctx: u32 // (a.k.a. `whisper.encoder.context_length`)
whisper.encoder.n_embd: u32 // (a.k.a. `whisper.encoder.hidden_size`)
whisper.encoder.n_layers: u32
whisper.encoder.n_mels: u32
whisper.encoder.attn.n_heads: u32 // (a.k.a. `whisper.encoder.attention.n_heads`)
whisper.decoder.n_ctx: u32 // (a.k.a. `whisper.decoder.context_length`)
whisper.decoder.n_embd: u32 // (a.k.a. `whisper.decoder.hidden_size`)
whisper.decoder.n_layers: u32
whisper.decoder.attn.n_heads: u32 // (a.k.a. `whisper.decoder.attention.n_heads`)
Regarding the naming convention (i.e. more descriptive or shorter, etc) - it will be difficult to find a consensus about the best naming strategy. I don't mind either way we choose, though my preference is as shown above.
OK, I'll apply the changes discussed soon. I'm going to pick the longer names but include the shorter names in their description, so that people can port code. My reasoning is that if the names are descriptive, the reader will be less likely to need to read the description to need to know what it is at a glance.
Sorry about the delay, have been busy. I've updated the spec to reflect our discussion, and to include whisper.cpp. I think it's in a pretty good place now.
I think we'll need the following to implement/validate the spec:
- Python script to convert a Hugging Face model to a valid GGUF
- Code to load a GGUF in
ggmland use it with the existing implementations
Luckily, @klosax already did these for v1 of the spec! Hopefully, we can just update this code and we should be good to go. (klosax, is that something you'd be interested in doing?)
From llm's side, we'll implement GGUF support soon-ish, but we should probably have a common set of test models first.
We could write the migration program proposed in the spec as we have access to a greater range of functionality, which could also potentially be used as a source of models. Not sure if that should live with ggml/llama.cpp, though. (It's easy enough for us to provide binaries for the major platforms.)
Luckily, @klosax already did these for v1 of the spec! Hopefully, we can just update this code and we should be good to go.
I think this should be implemented in llama.cpp . Other models could be added to llama.cpp . My implementation does only have basic examples and does not have all the great features of llama.cpp .