Bug: cannot find tokenizer merges in model file
What happened?
When I use transformers==4.45.1 and convert llama.cpp to the file used by ollama, there is no error, but when I load the model with ollama, the error ollama cannot find tokenizer merges in model file appears
Name and Version
所有版本
What operating system are you seeing the problem on?
No response
Relevant log output
No response
same problem
gguf-py/gguf/vocab.py def add_to_gguf(self, gw: GGUFWriter, quiet: bool = False) -> None: report: Adding merges requested but no merges found, output may be non-functional.
and in _try_load_from_tokenizer_json function:
def _try_load_from_tokenizer_json(self, path: Path) -> bool:
tokenizer_file = path / 'tokenizer.json'
if tokenizer_file.is_file():
with open(tokenizer_file, encoding = 'utf-8') as f:
if self.load_merges:
merges = tokenizer.get('model', {}).get('merges')
if isinstance(merges, list) and merges and isinstance(merges[0], str):
self.merges = merges
added_tokens = tokenizer.get('added_tokens', {})
isinstance(merges[0], str)
but transformers==4.45.1 generate tokenizer.json ,in tokenizer.json merges is list。
Can be compatible?
Hey hey, I'm VB from the open source team at Hugging Face. I can confirm that this is due to an update we've made to tokenizers - we persists merges as a list vs strings.
Everything should work on transformers 4.44.0 however from 4.45.0 onward it won't work and we'd need to add support for it.
For reference this is the tokenizers PR that introduced it: https://github.com/huggingface/tokenizers/pull/909
There are some temporary fixes which downgrade transformers to 4.44.2 for Unsloth members here: https://github.com/unslothai/unsloth/issues/1065 and https://github.com/unslothai/unsloth/issues/1062
Tagging @compilade for any insights how to best resolve this.
A couple of repos for testing:
- This is a Qwen model that was exported from transformers 4.45 and therefore uses the new tokenizer serialization format.
- This one is just a converted Llama 3.2 tokenizer.
The difference is the way merges are serialized in the tokenizer.json file. Each merge pair used to be a string with a space separating the two merges, but now each pair is saved as an array.
@pcuenca Thanks, I confirm that if I update to transformers 4.45 the conversion of this models succeeds using convert_hf_to_gguf.py. Without upgrading - it fails.
I wonder, should we try to find a way to make convert_hf_to_gguf.py work with pre-4.45 or should we just prompt the user to upgrade their transformers? The latter seems the obvious solution to me, but I could be missing something.
In my opinion, I think upgrading transformers is easier.
Opened a PR to update the transformers version in the short term: https://github.com/ggerganov/llama.cpp/pull/9694 (the CI errors look like warnings - not sure what to do about it)
We tested it with the new format and the old format:
- (new tokenisers format) https://huggingface.co/pcuenq/Qwen2.5-0.5B-Instruct-with-new-merges-serialization-Q8_0-GGUF
- (old tokenisers format) https://huggingface.co/pcuenq/Llama-3.2-1B-Instruct-Q8_0-GGUF
Upgrading to transformers 4.45 likely isn't enough; gguf.SpecialVocab(dir_model, load_merges=True) only works with the old format while silently ignoring everything else:
https://github.com/ggerganov/llama.cpp/blob/8277a817f18967581b02b2248989d773e8e99998/gguf-py/gguf/vocab.py#L123-L126
I wonder, should we try to find a way to make
convert_hf_to_gguf.pywork with pre-4.45or should we just prompt the user to upgrade their transformers?
To support the new format with older versions of transformers, that would require to avoid using AutoTokenizer.from_pretrained and/or fallback to full manual parsing of tokenizer.json. But that would not work with the current pre-tokenizer autodetection which relies on tokenizing strings.
So transformers has to be updated to 4.45 and gguf-py/gguf/vocab.py needs to be adapted to the new serialization, as in #9696
Should be resolved now. @nd791899 please close if it has been resolved.