llama.cpp Extend ggml format to include a description of the model.

Extend ggml format to include a description of the model.

Open darxkies opened this issue 1 year ago • 14 comments

On Hugging Face there are many files called ggml-model-f16.bin or similar. Once downloaded the user can rename them. The information about its origin gets lost. Updating the file becomes difficult when the origin is unknown. It would be easier to extend the ggml format so that the creator can embed a description of the model when generating them using 'quantize'.

May 23 '23 16:05 darxkies

not a fan of this idea. Not only would it break all prior formats for little reason again, it would also be unnecessary padding for those who don't need such information. And how large would you make it? GGML is a packed format. It's not like JSON where you can define arbitrary new fields of arbitrary lengths.

My suggestion instead would be to include such metadata either in the file name, or as an accompanying .txt file.

May 23 '23 16:05 LostRuins

I am not very familiar with the structure of the ggml files. I only know that it is a binary format and that it is really very compact.

One way to solve it, without having to change the format entirely, is to just append the text at the end of the ggml file. If the file is longer than it should be, based on the header, then the rest of the file is either ignored or treated as "meta" data. Just an idea.

May 23 '23 16:05 darxkies

That "blob" at the end of the ggml could be also used to describe the different prompt formats, stop sequences, and so on, that the model supports. ggml wouldn't have to know how to interpret the blob. It would just have to read it and pass it on.

May 23 '23 17:05 darxkies

Like this idea as there are so many different llama family models with the different prompts. We already have vocab embeded. for sure, we can add other information like prompts and description, even license information.

May 26 '23 06:05 howard0su

not a fan of this idea. Not only would it break all prior formats for little reason again, it would also be unnecessary padding for those who don't need such information.

I'm a little surprised by your position on this - it's impossible to know what model architecture you're dealing with, or how it should be configured, using the current GGML format, because all the contextual information is lost.

You can't figure out if you have a LLaMA or a GPT-NeoX model without hacks, because the only things you can use to identify that (the tensor names) are located after the hyperparameters, which you need to know the model architecture to read. You can scan through the file for the tensor names as strings, but that's brittle and it's unnecessary.

I'd happily take another format break (as long as it's managed correctly with a migration path!) if it allows all GGML consumers (including koboldcpp and llm etc) to be able to run any arbitrary supported mode, and for future models to extend the information they include without requiring a format break. (Imagine if the inclusion of use_parallel_residual in the GPT-NeoX architecture's hyperparameters didn't create an incompatible variant of the format!)

There's an issue on the GGML repo about this: https://github.com/ggerganov/ggml/issues/147

As for padding... the models are already gigabytes. Adding a few kilobytes is unnoticeable.

And how large would you make it? GGML is a packed format. It's not like JSON where you can define arbitrary new fields of arbitrary lengths.

Just embedding JSON would be an easy fix to this, but that would require a JSON reader/writer to be available which I assume is out of scope.

What I'd propose as an intermediary solution is a very simple binary key value format, where the key is stored as a string with len (same as the tensors), and the value is stored as (type tag, value). These k/v pairs are unordered, so they can be present and in any order, and new hyperparameters can be added without requiring a breaking format change.

Model authors can then use this mechanism to include additional information about source/license/prompting, if they're so inclined.

My suggestion instead would be to include such metadata either in the file name, or as an accompanying .txt file.

That requires model creators to be consistent with filenames and to maintain a filename schema, which is a comparatively large ask of a community compared to just embedding the relevant information in the model itself.

Text files can also be very easily lost, and would need to be structured if they're meant to be consumed by model loaders.

One of the strongest strengths of the GGML format is its one-file-one-model solution; unlike HuggingFace, where you have to clone an entire folder, you can distribute entire models with one file as long as you have a compatible executor. We should make the most of this.

One way to solve it, without having to change the format entirely, is to just append the text at the end of the ggml file. If the file is longer than it should be, based on the header, then the rest of the file is either ignored or treated as "meta" data. Just an idea.

For llm, I was considering embedding this extra contextual information as a U8 tensor containing JSON, but I was concerned about a loader trying to load that faux-tensor as an actual tensor. That isn't a problem for any of the primary executors at present (since they look for specific tensors), but I didn't want to risk it.

May 26 '23 15:05 philpax

@philpax hmm I get your point, but I think it will end up as a https://xkcd.com/927/ situation.

The problem is that such a "free comment field" is by definition arbitrary data. It's a big, unstructured scratch pad that anyone who wants to add their thing will do so - and then as an integrator it becomes even more work since you can't ensure that the field you want exists, and everyone will end up shoving whatever they want into it. If everyone were to use JSON that's already hard but at least gracefully handling an extra field or missing field isn't that difficult.

But data in a packed struct? Already we have one situation like this on the ggml repo where earlier NeoX models do not have the use_parallel_residual field, which ends up being at the ftype offset (because packed struct). I had to resort to some unpleasant hacks for my loader to handle both old and new NeoX formats correctly.

Now imagine if every NeoX author was now adding random fields at their discretion, maybe RedPajama uses the first 100 bytes of this block to add some extra vocab, meanwhile a Pythia enthusiast uses that to store default stochastic sampler values. And then a third party just stores a giant UTF-8 string that contains the huggingface tags for their model cause why not.

Bear in mind that since it's free form data - there's no indicator or no enforced standard, so even the same author might do different things for different versions.

It would be basically unusable.

May 27 '23 03:05 LostRuins

Sorry, meant to get back to you earlier.

I completely agree with you about the mess - I brought up the use_parallel_residual break because it was annoying for us, too.

That's why I'm suggesting that it's structured, and that ordering is irrelevant. That is, instead of storing the hyperparameters as

n_vocab: i32,
n_ctx: i32,
n_embd: i32,
n_head: i32,
n_layer: i32,
n_rot: i32,
use_parallel_residual: bool,
file_type: i32,

it's instead stored as an array of

key_length: u32,
key: [u8; key_length],
value_type: ValueType,
value: raw binary little-endian representation of value

so that you might have

[
  {
    key_length: 6,
    key: 'n_embd',
    value_type: ValueType::I32,
    value: 2560
  },
  {
    key_length: 11,
    key = 'use_parallel_residual',
    value_type = ValueType::Bool,
    value: true
  },
  ...
]

The brackets are for notational convenience - in practice, they're flatpacked and would come after each other in the binary. The ValueType enum would be standardized (like ggml_type), and so would the ways to represent each type of value.

This would allow for the addition of more parameters, readers to be more resilient to models coming from other sources, etc, because you'd be looking up values by key and trying to read them by binary.

It wouldn't be freeform - the storage medium would be entirely structured, so that any reader could pick up data from it without having to know about the other fields. As time goes on, I imagine this would look like ID3v2, with commonly-used tags being standardized by the community for whatever metadata they want to attach.

The main thing I want to achieve is to a) allow the reading of a GGML file knowing nothing else about it, even if you can't do anything with it and b) allow for community model authors to add useful metadata in a way that won't cause breakage for future readers, while still remaining maximally compatible.

May 28 '23 16:05 philpax

@philpax

I agree with the proposed extension - we should implement it

May 28 '23 17:05 ggerganov

In the longer run, cool would be a ggzip package containing:

config.json (flat structure with primitive types only)
license.txt (all licenses applicable to the model)
the weights binary itself, similar to now When generated with "mmap support" the zip compression would be 0, that should allow to map the binary 1:1 from within the zip. Of course this hypothetical ggzip format would be generated just like gg files are generated now.

The primary benefit of that approach is that, a bit similar to pytorch, a human readable json could define how the weights are to be used. Especially seeing superior (and legal) open models like Falcon 40B, and completely free models like those of Stability push out "llama" is likely soon a relic. All those new models need a bit different processing, they need an adaptive or specialized eval loop. This is going to get worse with more and more superio models in the next months.

May 28 '23 21:05 cmp-nct

I like the flexibility of @philpax suggestion. A few fields should be enforced as mandatory for all models for a model to be considered compliant - the currently existing fields perhaps. Adding a field to indicate the architecture name would be nice too.

If the file header itself changes, maybe we should change the file magic one last time e.g. gguf for g-g-universal-format (actually i don't care it could be anything different, just a random idea that popped up lol.

Some long-reaching considerations would be what to enforce - max length of a key/value? max amount of space reserved for these values? any sort of padding or alignment between elements to reserve space for future uses which we cannot think of right now?

May 29 '23 04:05 LostRuins

Awesome! Yeah, the magic might be worth changing if we make a change this comprehensive, I don't have any particularly strong opinions on that (GGUF / GGJTv4 would be handled the same way for us)

Regarding the considerations: good questions, I have no immediate answers but I'm fine with shipping without. Realistically, I can't imagine the metadata being a large part of the model compared to the tensors, and people who abuse the flexibility in their uploaded models will be policed by the community.

May 29 '23 09:05 philpax

I stumbled across this issue looking through the "help wanted" tag, and I am wondering: is this issue still relevant, or has the goal been achieved by the switch to GGUF? As far as I understand, the new format follows the principles described here (while #220 appears to be more ambitious and also includes discussions about a unified conversion tool).

Oct 06 '23 19:10 mgroeber9110

Yes, I would say this is more or less technically complete.

Oct 06 '23 19:10 philpax

Since the GGUF file format implements this, this issue is resolved and should be closed.

May 15 '24 00:05 arnfaldur

llama.cpp llama.cpp copied to clipboard

Extend ggml format to include a description of the model.

llama.cpp
llama.cpp copied to clipboard