llama.cpp Bug: cannot find tokenizer merges in model file

What happened?

When I use transformers==4.45.1 and convert llama.cpp to the file used by ollama, there is no error, but when I load the model with ollama, the error ollama cannot find tokenizer merges in model file appears

Name and Version

所有版本

What operating system are you seeing the problem on?

No response

Relevant log output

No response

Sep 30 '24 02:09 nd791899

same problem

Sep 30 '24 03:09 patrick60507

gguf-py/gguf/vocab.py def add_to_gguf(self, gw: GGUFWriter, quiet: bool = False) -> None: report： Adding merges requested but no merges found, output may be non-functional.

and in _try_load_from_tokenizer_json function：

def _try_load_from_tokenizer_json(self, path: Path) -> bool:
    tokenizer_file = path / 'tokenizer.json'
    if tokenizer_file.is_file():
        with open(tokenizer_file, encoding = 'utf-8') as f:
        if self.load_merges:
            merges = tokenizer.get('model', {}).get('merges')
            if isinstance(merges, list) and merges and isinstance(merges[0], str):
                self.merges = merges
        added_tokens = tokenizer.get('added_tokens', {})

isinstance(merges[0], str) but transformers==4.45.1 generate tokenizer.json ，in tokenizer.json merges is list。

Can be compatible?

Sep 30 '24 05:09 nd791899

Hey hey, I'm VB from the open source team at Hugging Face. I can confirm that this is due to an update we've made to tokenizers - we persists merges as a list vs strings.

Everything should work on transformers 4.44.0 however from 4.45.0 onward it won't work and we'd need to add support for it.

For reference this is the tokenizers PR that introduced it: https://github.com/huggingface/tokenizers/pull/909

Sep 30 '24 10:09 Vaibhavs10

There are some temporary fixes which downgrade transformers to 4.44.2 for Unsloth members here: https://github.com/unslothai/unsloth/issues/1065 and https://github.com/unslothai/unsloth/issues/1062

Sep 30 '24 10:09 danielhanchen

Tagging @compilade for any insights how to best resolve this.

Sep 30 '24 10:09 ggerganov

A couple of repos for testing:

This is a Qwen model that was exported from transformers 4.45 and therefore uses the new tokenizer serialization format.
This one is just a converted Llama 3.2 tokenizer.

The difference is the way merges are serialized in the tokenizer.json file. Each merge pair used to be a string with a space separating the two merges, but now each pair is saved as an array.

Sep 30 '24 10:09 pcuenca

@pcuenca Thanks, I confirm that if I update to transformers 4.45 the conversion of this models succeeds using convert_hf_to_gguf.py. Without upgrading - it fails.

I wonder, should we try to find a way to make convert_hf_to_gguf.py work with pre-4.45 or should we just prompt the user to upgrade their transformers? The latter seems the obvious solution to me, but I could be missing something.

Sep 30 '24 10:09 ggerganov

In my opinion, I think upgrading transformers is easier.

Sep 30 '24 11:09 pcuenca

Opened a PR to update the transformers version in the short term: https://github.com/ggerganov/llama.cpp/pull/9694 (the CI errors look like warnings - not sure what to do about it)

We tested it with the new format and the old format:

(new tokenisers format) https://huggingface.co/pcuenq/Qwen2.5-0.5B-Instruct-with-new-merges-serialization-Q8_0-GGUF
(old tokenisers format) https://huggingface.co/pcuenq/Llama-3.2-1B-Instruct-Q8_0-GGUF

Sep 30 '24 14:09 Vaibhavs10

Upgrading to transformers 4.45 likely isn't enough; gguf.SpecialVocab(dir_model, load_merges=True) only works with the old format while silently ignoring everything else:

https://github.com/ggerganov/llama.cpp/blob/8277a817f18967581b02b2248989d773e8e99998/gguf-py/gguf/vocab.py#L123-L126

I wonder, should we try to find a way to make convert_hf_to_gguf.py work with pre-4.45 or should we just prompt the user to upgrade their transformers?

To support the new format with older versions of transformers, that would require to avoid using AutoTokenizer.from_pretrained and/or fallback to full manual parsing of tokenizer.json. But that would not work with the current pre-tokenizer autodetection which relies on tokenizing strings.

So transformers has to be updated to 4.45 and gguf-py/gguf/vocab.py needs to be adapted to the new serialization, as in #9696

Sep 30 '24 15:09 compilade

Should be resolved now. @nd791899 please close if it has been resolved.

Oct 03 '24 14:10 ggerganov