ggml : unified file format

Open philpax opened this issue 1 year ago • 78 comments

Obsoletes #147, #150, https://github.com/ggerganov/llama.cpp/issues/1575, https://github.com/ggerganov/llama.cpp/issues/1590, https://github.com/rustformers/llm/discussions/143, and probably some other issues across some other repositories.

Please see the spec PR at #302; the following is left as-is so you can see the original proposal.

Current state of affairs

Overview

At present, there are two GGML file formats floating around for LLMs (and potentially other ggml-using projects, I haven't looked too much at the implementation of whisper):

GGML unversioned
GGJTv3 (same as v1 and v2, but with different quantization formats), which is similar to GGML but includes a version and aligns the tensors to allow for memory-mapping

Both of these formats share the same fundamental structure:

a magic number with an optional version number
model-specific hyperparameters that include a ftype that should describe the type of the majority of the tensors, and for GGML files, the quantization version encoded using a modulo in the ftype
an embedded vocabulary, which is a list of strings with length prepended. The GGMF/GGJT formats embed a f32 score next to the strings
finally, a list of tensors with their length-prepended name, type, and (aligned, in the case of GGJT) tensor data

We have more details on the format here: https://github.com/rustformers/llm/tree/main/crates/ggml#format

Drawbacks

Unfortunately, over the last few months, there are a few issues that have become apparent with the existing models:

There's no way to identify which model architecture a given model is for, because that information isn't present
- Similarly, existing programs cannot intelligently fail upon encountering new architectures
Adding or removing any new hyperparameters is a breaking change, which is impossible for a reader to detect without herculean hacks
Each model architecture requires its own conversion script to their architecture's variant of GGML
Maintaining backwards compatibility without breaking the structure of the format requires clever tricks, like packing the quantization version into the ftype, which are not guaranteed to be picked up by readers/writers, and are not consistent between the two formats

GGJTv4/GGUF

Based on this, I'd like to propose a new format that's designed to be universal and addresses these issues. It is largely identical to GGJTv3, but makes one important difference: the hyperparameters are encoded as an array of key-value pairs that can be read in any order, and these hyperparameters are used to encode additional information about the model. A really important property I'd like to keep is single-file deployment: if I give you a GGUF file and you have a compatible executor, it should Just Work:tm without any additional conversion or extra files.

"Specification"

To quote from https://github.com/ggerganov/llama.cpp/issues/1575#issuecomment-1566196582:

Instead of storing the hyperparameters as
n_vocab: i32,
n_ctx: i32,
n_embd: i32,
n_head: i32,
n_layer: i32,
n_rot: i32,
use_parallel_residual: bool,
file_type: i32,
it's instead stored as an array of
key_length: u32,
key: [u8; key_length],
value_type: ValueType,
value: raw binary little-endian representation of value
so that you might have
[
  {
    key_length: 6,
    key: 'n_embd',
    value_type: ValueType::I32,
    value: 2560
  },
  {
    key_length: 11,
    key = 'use_parallel_residual',
    value_type = ValueType::Bool,
    value: true
  },
  ...
]
The brackets are for notational convenience - in practice, they're flatpacked and would come after each other in the binary. The ValueType enum would be standardized (like ggml_type), and so would the ways to represent each type of value.

This would allow for the addition of more parameters, readers to be more resilient to models coming from other sources, etc, because you'd be looking up values by key and trying to read them by binary.

It wouldn't be freeform - the storage medium would be entirely structured, so that any reader could pick up data from it without having to know about the other fields. As time goes on, I imagine this would look like ID3v2, with commonly-used tags being standardized by the community for whatever metadata they want to attach.

The main thing I want to achieve is to a) allow the reading of a GGML file knowing nothing else about it, even if you can't do anything with it and b) allow for community model authors to add useful metadata in a way that won't cause breakage for future readers, while still remaining maximally compatible.

Filling in some of the missing details:

Keys

Keys are ASCII lower_snake_case with dots for separation. Their length is stored before the key. They have a maximal length of 256 (open for debate, just a number I picked that seems like a reasonable upper bound).

This means that:

vocabulary.hugging_face is a valid key
vocabulary-hugging-face is not
Vocabulary.HuggingFace is not
vocabulary.hugging-face is not

I'd say we're looking at something like TOML keys without quotation.

Values

Values are one of the following types:

U32: little-endian unsigned 32-bit integer
I32: little-endian signed 32-bit integer (honestly not sure if this is necessary, I feel like a lot of the existing i32 use has been more just due to the use of int than anything)
F32: IEEE754 32-bit floating point number
String: UTF-8 string data, length prepended
Bytes: Raw binary data with no specific meaning attached, length prepended
Boolean: 1-byte value where 0 is false and 1 is true. Anything else is invalid. I considered making anything other than 0 true, but being strict on this will help detect misbehaving writers.

Standardized key-value pairs

This list is incomplete. Feel free to suggest additions. Where possible, I've tried to use the original names from the models to remove a layer of semantic confusion.

This is just from a quick appraisal of the models that llm supports. There are likely other fields that we can standardise ahead of time by looking at the HuggingFace config.

General

general.architecture: String: describes what architecture this model implements. Values can include llama, mpt, gpt-neox, gpt-j, gpt-2, bloom, etc. (List more if you can think of them, and they're not just variants of existing architectures!)
general.quantization_version: u32: version of quantization scheme
general.file_type: String: type of the majority of the tensors in the file. This shouldn't have any semantic meaning and should be purely informational, hence the use of String.
general.license: String: SPDX license of the model
general.description: String: information about the model, including provenance
general.original_model_url: String: path to the original model that this GGML file was created from

LLM

llm.context_length: u32: size of the maximum supported context
llm.hidden_size: u32: embedding layer size
llm.num_hidden_layers: u32: number of hidden layers
llm.num_rotary: u32: int(hparams["rotary_pct"]*(hparams["hidden_size"]//hparams["num_attention_heads"]))
llm.use_parallel_residual: bool: whether or not the parallel residual logic should be used
llm.max_seq_len: u32: Maximum sequence length
llm.attention.num_heads: u32: number of attention heads
llm.attention.alibi_bias_max: f32: The maximum bias to use for ALiBI
llm.attention.clip_kqv: f32: not sure

Vocabulary

vocabulary.embedded_size: u32: size of the embedded vocabulary. Zero if there is no embedded vocabulary.
vocabulary.huggingface_tokenizer_json: String: the entirety of the HF tokenizer.json for a given model (e.g. https://huggingface.co/mosaicml/mpt-7b-instruct/blob/main/tokenizer.json). Optional, but highly recommended for best tokenization quality with supported executors.

Future

This is not something we should aim for in the MVP, but ggml now has support for exporting the computation graph. A sample computation graph could be embedded to allow an executor to run the model without having direct support for the architecture.

Migration

The existing migrations have been pretty messy for the ecosystem and for the community. We should try to avoid causing significant upset by providing a migration path.

My suggestion is to switch over all model implementations, including llama.cpp, to GGUF, but offer a very straightforward conversion utility that does not require Python and can convert GGML and GGJTv3 to GGUF with all required information.

If interested, we could also include support for GGJT v1 and v2 using https://github.com/ggerganov/llama.cpp/pull/1504 (although the requantisation process is inherently lossy).

Hopefully, this is the last time we have to bite this bullet. Even if we make breaking changes (like quantization version) again, software consuming GGUF can intelligently decide what to do based on the available information in the hyperparameters.

New model architectures can use GGUF without any additional work, so no breaking changes should be necessary there, either.

Conversion of Python models to GGUF

Ideally, all of the existing convert-h5-to-ggml.py and convert.py scripts can be entirely deprecated. Instead, there is one script that takes an arbitrary HuggingFace model and converts it to a compatible GGUF file. This vastly reduces the maintenance burden and makes it simpler to action changes across the ecosystem when necessary.

cc @ggerganov @LostRuins @KerfuffleV2 @LLukas22 @TheBloke @iacore @comex and others who work with GGML models

May 31 '23 11:05 philpax

technically speaking, we also had a GGMFv1, the one before the memory mapped GGJTv1

May 31 '23 11:05 Green-Sky

there is also the new .ggml wip file, which contains the computation graph. https://github.com/ggerganov/ggml/commit/3b697a2264c5dd132abb3268f6b1091536f3f9ff

May 31 '23 11:05 Green-Sky

Wonderful, I thought I would go for safetensors but they are not really willing to extend the spec for quantized dtypes. Obviously, I wanted to avoid GGML because there was no spec. If there is a spec, I am all in for this.

BTW: I would be also super-exited about the graph-saving/loading, I thought I would refactor my logic to be agnostic over Model vs. Graph, because both just need input and have output, so for inference, it shouldn't matter if I have a graph or a real model. (where model can be for example fine-tuned or something, but graph can be only evaluated)

May 31 '23 11:05 cztomsik

This is the first step to realize a unified llm API and interface and that would handle any supported architecture.

https://github.com/ggerganov/llama.cpp/issues/1602#issuecomment-1568215353 https://github.com/ggerganov/ggml/issues/185 https://github.com/ggerganov/ggml/pull/145#issuecomment-1544733902

May 31 '23 12:05 klosax

Their length is stored before the key. They have a maximal length of 256

255 makes more sense if you're going to use a byte to store the length. Unless you want to always add +1 to that value.

general.architecture: String: describes what architecture this model implements. Values can include llama, mpt, gpt-neox, gpt-j, gpt-2, bloom, etc.

It might make more sense to make something like general.type which could be ggml and then put GGML-specific stuff under ggml like ggml.architecture: String. That way there would be the possibility of using this container format for non-GGML models.

If you're going to go through a bunch of trouble designing a model container format, it seems like it would make sense to make it something that could just generally be used.

That would also mean tools that manipulate it wouldn't really have to care if it was GGML or some other type of model.

llm.num_hidden_layers: u32: number of hidden layers

and similar - Why not use the architecture as the base key? A different architecture of model isn't necessarily going to have hidden layers, rotary, etc. It might have its own stuff. Just as an example, RWKV models don't even have attention heads.

So instead, you'd have llama.num_rotary.

Did I miss it or does this not really describe how/where the actual tensors get defined? I actually like the SafeTensors approach quite a bit where the metadata just defines position and length. The only thing I'd change from that is adding a requirement that the tensor data has to start align.

You'd want to pick an alignment that is pretty future proof and works for most CPU architectures and most types. What is it GGML uses internally, 64 bytes? That wastes a little space but it's not enough to really matter.

May 31 '23 12:05 KerfuffleV2

This is the first step to realize a unified llm API and interface and that would handle any supported architecture.

ggerganov/llama.cpp#1602 (comment) #185 #145 (comment)

Yep! We already implement this in llm in the Rust world, but we'd love to see upstream support for this and to begin consolidating the various examples into a cohesive framework so that we can all benefit.

Their length is stored before the key. They have a maximal length of 256

255 makes more sense if you're going to use a byte to store the length. Unless you want to always add +1 to that value.

Sure. I wasn't thinking about using a byte for the length, but that's entirely reasonable.

general.architecture: String: describes what architecture this model implements. Values can include llama, mpt, gpt-neox, gpt-j, gpt-2, bloom, etc.

It might make more sense to make something like general.type which could be ggml and then put GGML-specific stuff under ggml like ggml.architecture: String. That way there would be the possibility of using this container format for non-GGML models.

If you're going to go through a bunch of trouble designing a model container format, it seems like it would make sense to make it something that could just generally be used.

That would also mean tools that manipulate it wouldn't really have to care if it was GGML or some other type of model.

I'm not opposed, but I'd like to see a motivating case first. I think this is most likely to be implemented by all parties if we can agree on a reasonable extension from the original format.

llm.num_hidden_layers: u32: number of hidden layers

and similar - Why not use the architecture as the base key? A different architecture of model isn't necessarily going to have hidden layers, rotary, etc. It might have its own stuff. Just as an example, RWKV models don't even have attention heads.

So instead, you'd have llama.num_rotary.

No particular reason. I saw a commonality and merged them; if people decide using the architecture as base key makes more sense, I'm happy to go with that.

Did I miss it or does this not really describe how/where the actual tensors get defined? I actually like the SafeTensors approach quite a bit where the metadata just defines position and length. The only thing I'd change from that is adding a requirement that the tensor data has to start align.

You'd want to pick an alignment that is pretty future proof and works for most CPU architectures and most types. What is it GGML uses internally, 64 bytes? That wastes a little space but it's not enough to really matter.

Yeah, this just uses the current GGJTv3 scheme in the interest of minimising the amount of work required to migrate to the format. No opposition to moving to a ST-like format from me, but I also don't feel particularly strongly about it. What does everyone else think?

May 31 '23 12:05 philpax

general.architecture: String: describes what architecture this model implements. Values can include llama, mpt, gpt-neox, gpt-j, gpt-2, bloom, etc.

It might make more sense to make something like general.type which could be ggml and then put GGML-specific stuff under ggml like ggml.architecture: String. That way there would be the possibility of using this container format for non-GGML models.

Even better (the value sets the key):

general.type = ggml ggml.type = llm llm.architecture = llama llama.num_rotary

May 31 '23 12:05 klosax

@klosax i'd say the ggml magic would take care of that - ideally non-ggml formats shouldn't be using it as a container format. No need to over engineer it (my 2c).

May 31 '23 13:05 LostRuins

Could we also include some optional generation parameters. Which contain default values for some sampling parameters? Or would that be to specific?

May 31 '23 13:05 LLukas22

I would recommend including stuff that's mainly essential for loading the model - things that are required for proper functioning. Samplers are technically not even dependent on the model - user is free to do with the output logits as they please.

May 31 '23 13:05 LostRuins

Agree with sampling parameters not being essential (especially since you can use whatever sampler you want with whatever model.)

That being said, that reminds me - it might be a good idea to include suggested prompt formats as one of the standardised config parameters. Feel free to 👍 or 👎 this post if you think that's too extra.

May 31 '23 16:05 philpax

Hmm I think that will be fine as an optional parameter, but not as a standard parameter. Standard params should be stuff that are required for loading correctly, like use_parallel_residual and quantization types.

Also prompt formats may not even make sense universally (they're kindof an instruct model thing). I have a model trained on literature, it has no prompt format, it just spews out prose. I also have another model that just generates long sequences of increasing numbers. Likewise... base llama has no prompt format.

May 31 '23 16:05 LostRuins

Is it possible that something like this could be useful https://github.com/khonsulabs/pot? It seems like at a high-level this discussion revolves around the best way to construct a self-describing data format, which is a problem that I think has already been addressed to a certain extent.

More ideas here: https://github.com/yasammez/nachricht#prior-art

May 31 '23 16:05 danforbes

Hmm I think that will be fine as an optional parameter, but not as a standard parameter.

Yes. Sorry, to clarify, when I say "standard" I don't mean they should be included in all models. It's just that if you do add a prompt format, you should call it something we've declared here, so that things that expect it know what to look for.

I'll go through the list of k-v pairs up there to clarify which ones are required and which ones are standardised-in-name but otherwise optional, but I'll wait for feedback on the rest of the proposal first.

Is it possible that something like this could be useful https://github.com/khonsulabs/pot? It seems like at a high-level this discussion revolves around the best way to construct a self-describing data format, which is a problem that I think has already been addressed to a certain extent.

Normally, yeah, I'd just use a self-descriptive standard format. However, GGML/llama.cpp aim to be as dependency-free as possible, so something moderately bespoke but not too complex is more likely to be accepted by the wider community.

May 31 '23 16:05 philpax

GGML/llama.cpp aim to be as dependency-free as possible

Adopting a format or specification doesn't necessarily mean taking on any new dependencies, and it would allow for greater focus to be placed on the "secret sauce", which I think will be the standardized key/value pairs and what they are meant to specify.

May 31 '23 16:05 danforbes

That being said, that reminds me - it might be a good idea to include suggested prompt formats as one of the standardised config parameters. Feel free to +1 or -1 this post if you think that's too extra.

I think this will be needed to run inference in instruction mode on any instruction tuned model. It is maybe enough with a key telling what supported standardized prompt formatting to use. If the key is missing, no instruction mode inference will be available. llm.instruct_format = alpaca

May 31 '23 16:05 klosax

I recommend extending safetensors. Only libggml need to load the model correctly anyways. See original discussion here: https://github.com/rustformers/llm/discussions/143

What extensions we need

include GGML types like ggml_q4_0
include hypeparams and vocab in metadata (this in already in spec)

May 31 '23 17:05 iacore

I recommend extending safetensors.

Considering that the safetensors project already answers the question Yet another format? I think this is unambiguously the right thing to do.

May 31 '23 18:05 danforbes

Considering that the safetensors project already answers the question

You'd have to fork it to do that, they don't seem interested in extending it. Based on existing discussion, it seems like they want to lock it down and reduce its extensibility further by, for example, forbidding gaps between tensors (even though the format currently would allow it since the metadata only says where tensors start and their length).

May 31 '23 18:05 KerfuffleV2

You'd have to fork it to do that, they don't seem interested in extending it.

I disagree. The format itself is very simple. The huggingface parser is not that good, and we need to write the parser in C (for ggml) anyways. The safetensors format is just a format. If we get enough people to use our version, then our version becomes the "official" one.

May 31 '23 18:05 iacore

I agree with all of the above, but that's basically what I'd call "forking" it. Taking that project and basing another one on it that takes a different approach, has different requirements, sets different restrictions, etc.

quick edit: Probably also should add: While I'm not really a fan of the direction they seem to have chosen, I personally wouldn't use the approach of forcefully trying to take control away. If it was me, I'd start with the SafeTensors format but call it something different.

May 31 '23 18:05 KerfuffleV2

I agree with Kerfuffle that that would be a non-ideal turn of events and would likely alienate an ecosystem that we should stay on good terms with.

safetensors as a format comes with certain assumptions that we should not singlehandedly override - it will end up causing a similar problem to what we have now, but with someone else's format, and with a lot more bad blood.

In any case, I'd like to request that we keep discussion about switching formats or fundamentally changing the structure of this format out of this issue. Feel free to open another issue.

I'm looking for a solution that solves the immediate issues the ecosystem is encountering at the least cost possible; we're not trying to find the perfect solution here, but the one that enables the most reusability / functionality at the least cost.

This format's not perfect by any means, but it's simple, easy to work with and understand (i.e. can be parsed from C without too much suffering), and more importantly: it powers an existing ecosystem with inertia.

The more complicated we make this change and the more parties we involve, the harder it will be to actually make the change. Let's keep it on track.

May 31 '23 18:05 philpax

we can name this .safetensors-ggml or something.

May 31 '23 18:05 iacore

vocabulary.huggingface_tokenizer_json: String: the entirety of the HF tokenizer.json for a given model .. Optional, but highly recommended for best tokenization quality with supported executors.

Why would json give a higher quality than the current layout?

Some models dont have a tokenizer.json, Replit uses spiece.model. How should such vocab be handled?

To support any vocab, maybe a key like vocabulary.encoding (defaulting to utf-8) would be needed?

May 31 '23 19:05 klosax

vocabulary.huggingface_tokenizer_json: String: the entirety of the HF tokenizer.json for a given model .. Optional, but highly recommended for best tokenization quality with supported executors.

This makes the tokenizer config less portable. The tokenizer file is usually loaded by an external library from a file.

May 31 '23 19:05 iacore

vocabulary.huggingface_tokenizer_json: String: the entirety of the HF tokenizer.json for a given model .. Optional, but highly recommended for best tokenization quality with supported executors.

Why would json give a higher quality than the current layout?

Some models dont have a tokenizer.json, Replit uses spiece.model. How should such vocab be handled?

To support any vocab, maybe a key like vocabulary.encoding (defaulting to utf-8) would be needed?

Good question. For context: llm has support for using tokenizers directly, so we can load a tokenizer.json (which seems common for the models we support). That JSON file has a lot of specifics about tokenization that aren't captured in the (token, score) embedded vocabulary.

I wasn't aware of the existence of other ways to store the tokenization data, and I'd have to look into it. Do you have any further information about it that I could look into?

To support any vocab, maybe a key like vocabulary.encoding (defaulting to utf-8) would be needed?

Is encoding the only thing that can diverge? I'll admit I am not too across the nuances here - my understanding is that the HF models have their complex tokenizers, and then the Python conversion scripts load those in and extract (token, score) tuples that a GGML executor can use to tokenize a string, except it may not account for all of the complexities of the original tokenizer.

This makes the tokenizer config less portable. The tokenizer file is usually loaded by an external library from a file.

Yes, that's why it's optional. The (token, score) scheme can still be used, but I'd like for users to be able to use the original HF tokenizers out of the box if possible.

May 31 '23 19:05 philpax

I thoroughly support any effort to produce a new format which will be future-proof and will protect against any more breaking changes.

I know it's probably not on the cards but what I would really love is if this change would eventually lead to llama.cpp being able to load any GGML model, like GPTJ, MPT, etc. If that's not being considered then at least if a standardised format would allow for non-compatible clients to inform the user that this is not a supported model then that would help a lot.

The idea of using safetensors sounds smart, although if it is used I think it'd be ideal to change the name for this fork of safetensors. safeggml perhaps. Otherwise I am envisaging a lot more support requests along the lines of "I downloaded the GPTQ, why won't it work in llama.cpp - they're both safetensors?"

I really like the idea of an embedded prompt template. Users are asking more and more for prompt templates to be communicated. Having that in the format itself sounds like a great idea.

I have a feature request of my own: multi-part files. It'd be really helpful if this change could bring back support for multi-part GGML files. safetensors would support that natively I guess. This would be useful because of the Hugging Face Hub limit of 50GB per file, which prevents uploading 65B q8_0 models unless they're uploaded eg as a multi-part ZIP, which is messy and extra work for uploader and user alike. I could also imagine that in the future we might see some new larger models - perhaps a Falcon 80B for example - which might similarly not be possible to upload in the higher quant sizes. Multi-part GGML would solve that neatly.

Great work, hope this gets implemented!

May 31 '23 20:05 TheBloke

I wasn't aware of the existence of other ways to store the tokenization data, and I'd have to look into it. Do you have any further information about it that I could look into?

Replit is implemented here. Look at the conversion script. It needs a special tokenizer implemented in main.cpp

In the MPT example you can see what had to be done to correctly encode (in convert script) and decode (in main.cpp) the gpt-neox vocab.

Maybe the vocabs that are not json could be converted to it when creating the gguf file?

May 31 '23 20:05 klosax

I thoroughly support any effort to produce a new format which will be future-proof and will protect against any more breaking changes.

Awesome! Yeah, I figured you might have a stake in this 😂

I know it's probably not on the cards but what I would really love is if this change would eventually lead to llama.cpp being able to load any GGML model, like GPTJ, MPT, etc. If that's not being considered then at least if a standardised format would allow for non-compatible clients to inform the user that this is not a supported model then that would help a lot.

Agreed, that would be ideal. I left the possibility of this open in the future section:

This is not something we should aim for in the MVP, but ggml now has support for exporting the computation graph. A sample computation graph could be embedded to allow an executor to run the model without having direct support for the architecture.

but I'm not sure how far along the cgraph export/import functionality is, or how stable it is. I figured we can add that as an extension once that's solidified a bit.

I'd be happy just to have llm and friends gracefully fail when the architecture isn't recognised, instead of plowing through and trying to read invalid hyperparameters 😂

The idea of using safetensors sounds smart, although if it is used I think it'd be ideal to change the name for this fork of safetensors. safeggml perhaps. Otherwise I am envisaging a lot more support requests along the lines of "I downloaded the GPTQ, why won't it work in llama.cpp - they're both safetensors?"

100% agreed - we were bouncing around ST support a couple months ago for llm, but one of my primary concerns is that we'd encourage users to seek out non-GGML-augmented ST models and get confused by those not working. An extension change might work, but we'd still have to set up our own pipelines for doing so and we'd still be creating a format that wouldn't be compatible.

I'm not opposed to the use of safetensors (we're likely to support it in llm at some point, or a variant of it), but it's easier to make GGML fit-for-purpose than to try to repurpose a format that other people are using and doesn't support what we need yet.

I really like the idea of an embedded prompt template. Users are asking more and more for prompt templates to be communicated. Having that in the format itself sounds like a great idea.

Glad to hear it. Do you have any suggestions for what that might look like/what needs to be supported?

I have a feature request of my own: multi-part files. It'd be really helpful if this change could bring back support for multi-part GGML files.

Aaahhh, I did think about this but I'm not sure about it. I feel like that's conflating a distribution concern with a deployment concern; do you think you'd still need this if it weren't for the HF limit? Would it be a significant improvement over uploading multipart ZIPs?

Replit is implemented here. Look at the conversion script. It needs a special tokenizer implemented in main.cpp

Ah... I see... they have a custom sentencepiece tokenizer. Yeah, not sure how to best handle that. @Narsil is that something tokenizers can support and/or be a part of tokenizer.json?

May 31 '23 21:05 philpax

If a major file format change is going to happen again the tokenizer configs for the models using huggingface tokenizers BPE/GPT-2-like tokenizers ought to be improved (i.e. all but the SentencePiece ones - which I think are less broken but I haven't looked into them as much), all the formats that only store the vocab list and not the merges and have no way of identifying the "additional" tokens are, unfortunately, incomplete.

When encoding they should, after doing the "pretokenizing" stage with the regex, merge bigrams in the order they occur in the merges list, which will not necessarily get the same result as just taking the longest matching token. The logic in minGPT's implementation of GPT2's tokenizer is a good reference: https://github.com/karpathy/minGPT/blob/master/mingpt/bpe.py#L95

Tokens added after "training", the ones in the "added_tokens" section of tokenizer.json need to be handled separately - see this comment in tokenizers for an explanation: https://github.com/huggingface/tokenizers/blob/cb819724eff2769aa1211b0f296649ceb502ccc4/tokenizers/src/tokenizer/added_vocabulary.rs#L130-L140

Lastly, to totally match the behavior of tokenizers, unicode normalization is required - I think most models settled on NFC form but tokenizers supports all of them.

I have a C++ implementation of enough of that to correctly encode ChatML prompts as used by MPT-7B-Chat at https://github.com/apage43/bpe.cpp but it depends on ICU for two things, the unicode normalization, which might be possible to live without, and the pretokenizing regex being unicode-aware when splitting on "letter" characters, which is somewhat important for handling non-English text.

May 31 '23 23:05 apage43

ggml ggml copied to clipboard

ggml : unified file format

Current state of affairs

Overview

Drawbacks

GGJTv4/GGUF

"Specification"

Keys

Values

Standardized key-value pairs

General

LLM

Vocabulary

Future

Migration

Conversion of Python models to GGUF

ggml
ggml copied to clipboard