The behavior of the tokenizer loaded from GGUF file is incorrect.
System Info
transformersversion: 4.42.0.dev0- Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.27
- Python version: 3.11.9
- Huggingface_hub version: 0.23.4
- Safetensors version: 0.4.3
- Accelerate version: 0.30.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script: No
- Using GPU in script: No
- GPU type: NVIDIA RTX A6000
Who can help?
@ArthurZucker @younesbelkada
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
I install transformers from https://github.com/huggingface/transformers/pull/30391#issuecomment-2158719891 :
pip install git+https://github.com/younesbelkada/transformers.git@fix-llama-3-gguf-2
because the newest released version v4.41.2 cannot load tokenizer from gguf file correctly.
Here is my code:
from transformers import AutoTokenizer
model_id = "QuantFactory/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B.Q4_K_M.gguf"
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
# the text is a slice from load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
text = "Traditional Chinese literary criticism emphasized the life of the author when interpreting a work"
encodings_1 = tokenizer.encode(text)
print(encodings_1)
print(tokenizer.decode(encodings_1))
the output is:
[128000, 14860, 223, 85782, 10634, 223, 46023, 10634, 223, 69191, 661, 10634, 223, 38096, 42914, 10634, 223, 336, 51480, 1534, 10634, 223, 1820, 10634, 223, 14789, 10634, 223, 1073, 10634, 223, 1820, 10634, 223, 3170, 10634, 223, 9493, 10634, 223, 75814, 1303, 10634, 223, 64, 10634, 223, 1816]
<|begin_of_text|> āTraditionalāChineseāliteraryācriticismāemphasizedātheālifeāofātheāauthorāwhenāinterpretingāaāwork
The output of decode() should be identical to the text, shouldn't it? I also tried to encode the same text using llama-cpp-python 0.2.79 and the same model:
from llama_cpp import Llama
model_id = "QuantFactory/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B.Q4_K_M.gguf"
llm_lcp_from_hf = Llama.from_pretrained(repo_id=model_id, filename=filename)
encodings_lcp = llm_lcp_from_hf.tokenize(text.encode('utf-8'))
print(encodings_lcp)
print(llm_lcp_from_hf.detokenize(encodings_lcp).decode('utf-8'))
The output is right:
[85782, 8620, 32465, 19347, 46728, 279, 2324, 315, 279, 3229, 994, 66744, 264, 990]
Traditional Chinese literary criticism emphasized the life of the author when interpreting a work
Expected behavior
The result of decode() should be identical to the raw text.
Hey! I think this was recently fixed, so installing 4.42.xxx should work. I just tested locally:
Make sure to install
pip install -U transformers
Hey! I think this was recently fixed, so installing
4.42.xxxshould work. I just tested locally:Make sure to install
pip install -U transformers
Thank you @ArthurZucker , now the tokenizer works well. However, when I try to save and then load it, another error occurs:RuntimeError: Internal: could not parse ModelProto from ...
Code:
from transformers import AutoTokenizer
model_id = "QuantFactory/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B.Q4_K_M.gguf"
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
save_dir = '../../deq_models/test'
tokenizer.save_pretrained(save_dir)
tokenizer2 = AutoTokenizer.from_pretrained(save_dir)
The package version: sentencepiece 0.2.0 transformers 4.42.3 Traceback:
{
"name": "RuntimeError",
"message": "Internal: could not parse ModelProto from ../../deq_models/test/tokenizer.model",
"stack": "---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 tokenizer2 = AutoTokenizer.from_pretrained(save_dir)
File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py:889, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
885 if tokenizer_class is None:
886 raise ValueError(
887 f\"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported.\"
888 )
--> 889 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
891 # Otherwise we have to be creative.
892 # if model is an encoder decoder, the encoder tokenizer class is used by default
893 if isinstance(config, EncoderDecoderConfig):
File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2163, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
2160 else:
2161 logger.info(f\"loading file {file_path} from cache at {resolved_vocab_files[file_id]}\")
-> 2163 return cls._from_pretrained(
2164 resolved_vocab_files,
2165 pretrained_model_name_or_path,
2166 init_configuration,
2167 *init_inputs,
2168 token=token,
2169 cache_dir=cache_dir,
2170 local_files_only=local_files_only,
2171 _commit_hash=commit_hash,
2172 _is_local=is_local,
2173 trust_remote_code=trust_remote_code,
2174 **kwargs,
2175 )
File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2397, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
2395 # Instantiate the tokenizer.
2396 try:
-> 2397 tokenizer = cls(*init_inputs, **init_kwargs)
2398 except OSError:
2399 raise OSError(
2400 \"Unable to load vocabulary from file. \"
2401 \"Please check that the provided vocabulary is accessible and not corrupted.\"
2402 )
File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama_fast.py:157, in LlamaTokenizerFast.__init__(self, vocab_file, tokenizer_file, clean_up_tokenization_spaces, unk_token, bos_token, eos_token, add_bos_token, add_eos_token, use_default_system_prompt, legacy, add_prefix_space, **kwargs)
154 if add_prefix_space is not None:
155 kwargs[\"from_slow\"] = True
--> 157 super().__init__(
158 vocab_file=vocab_file,
159 tokenizer_file=tokenizer_file,
160 clean_up_tokenization_spaces=clean_up_tokenization_spaces,
161 unk_token=unk_token,
162 bos_token=bos_token,
163 eos_token=eos_token,
164 add_bos_token=add_bos_token,
165 add_eos_token=add_eos_token,
166 use_default_system_prompt=use_default_system_prompt,
167 add_prefix_space=add_prefix_space,
168 legacy=legacy,
169 **kwargs,
170 )
171 self._add_bos_token = add_bos_token
172 self._add_eos_token = add_eos_token
File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py:131, in PreTrainedTokenizerFast.__init__(self, *args, **kwargs)
127 kwargs.update(additional_kwargs)
129 elif self.slow_tokenizer_class is not None:
130 # We need to create and convert a slow tokenizer to build the backend
--> 131 slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs)
132 fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
133 else:
File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py:171, in LlamaTokenizer.__init__(self, vocab_file, unk_token, bos_token, eos_token, pad_token, sp_model_kwargs, add_bos_token, add_eos_token, clean_up_tokenization_spaces, use_default_system_prompt, spaces_between_special_tokens, legacy, add_prefix_space, **kwargs)
169 self.add_eos_token = add_eos_token
170 self.use_default_system_prompt = use_default_system_prompt
--> 171 self.sp_model = self.get_spm_processor(kwargs.pop(\"from_slow\", False))
172 self.add_prefix_space = add_prefix_space
174 super().__init__(
175 bos_token=bos_token,
176 eos_token=eos_token,
(...)
187 **kwargs,
188 )
File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py:198, in LlamaTokenizer.get_spm_processor(self, from_slow)
196 tokenizer = spm.SentencePieceProcessor(**self.sp_model_kwargs)
197 if self.legacy or from_slow: # no dependency on protobuf
--> 198 tokenizer.Load(self.vocab_file)
199 return tokenizer
201 with open(self.vocab_file, \"rb\") as f:
File ~/miniconda3/envs/swq/lib/python3.11/site-packages/sentencepiece/__init__.py:961, in SentencePieceProcessor.Load(self, model_file, model_proto)
959 if model_proto:
960 return self.LoadFromSerializedProto(model_proto)
--> 961 return self.LoadFromFile(model_file)
File ~/miniconda3/envs/swq/lib/python3.11/site-packages/sentencepiece/__init__.py:316, in SentencePieceProcessor.LoadFromFile(self, arg)
315 def LoadFromFile(self, arg):
--> 316 return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: could not parse ModelProto from ../../deq_models/test/tokenizer.model"
}
Sorry for the misoperation.
cc @itazap
I found that this is caused by setting add_prefix_space=False in GGUFLammaConverter. In turn, then the from_slow=True is forced from #28010 . I checked loading from "meta-llama/Meta-Llama-3-8B" and I don't believe the add_prefix_space=False [#30391] is necessary, I checked tokenization and a prefix space is not added when set to None. I can push a fix to change it to add_prefix_space=None (& test!) unless @ArthurZucker sees an issue with this?
@Lin-xs as a workaround for now, you can pass add_prefix_space=False like below to avoid the error!
AutoTokenizer.from_pretrained(model_id, gguf_file=filename, add_prefix_space=False)
Make sure to install