transformers The behavior of the tokenizer loaded from GGUF file is incorrect.

System Info

transformers version: 4.42.0.dev0
Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.27
Python version: 3.11.9
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.30.1
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script: No
Using GPU in script: No
GPU type: NVIDIA RTX A6000

Who can help?

@ArthurZucker @younesbelkada

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I install transformers from https://github.com/huggingface/transformers/pull/30391#issuecomment-2158719891 :

pip install git+https://github.com/younesbelkada/transformers.git@fix-llama-3-gguf-2

because the newest released version v4.41.2 cannot load tokenizer from gguf file correctly.

Here is my code:

from transformers import AutoTokenizer

model_id = "QuantFactory/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B.Q4_K_M.gguf"
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)

# the text is a slice from load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
text = "Traditional Chinese literary criticism emphasized the life of the author when interpreting a work"

encodings_1 = tokenizer.encode(text)
print(encodings_1)
print(tokenizer.decode(encodings_1))

the output is:

[128000, 14860, 223, 85782, 10634, 223, 46023, 10634, 223, 69191, 661, 10634, 223, 38096, 42914, 10634, 223, 336, 51480, 1534, 10634, 223, 1820, 10634, 223, 14789, 10634, 223, 1073, 10634, 223, 1820, 10634, 223, 3170, 10634, 223, 9493, 10634, 223, 75814, 1303, 10634, 223, 64, 10634, 223, 1816]
<|begin_of_text|> ▁Traditional▁Chinese▁literary▁criticism▁emphasized▁the▁life▁of▁the▁author▁when▁interpreting▁a▁work

The output of decode() should be identical to the text, shouldn't it? I also tried to encode the same text using llama-cpp-python 0.2.79 and the same model:

from llama_cpp import Llama
model_id = "QuantFactory/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B.Q4_K_M.gguf"
llm_lcp_from_hf =  Llama.from_pretrained(repo_id=model_id, filename=filename)
encodings_lcp = llm_lcp_from_hf.tokenize(text.encode('utf-8'))

print(encodings_lcp)
print(llm_lcp_from_hf.detokenize(encodings_lcp).decode('utf-8'))

The output is right:

[85782, 8620, 32465, 19347, 46728, 279, 2324, 315, 279, 3229, 994, 66744, 264, 990]
Traditional Chinese literary criticism emphasized the life of the author when interpreting a work

Expected behavior

The result of decode() should be identical to the raw text.

Jun 26 '24 06:06 Lin-xs

Hey! I think this was recently fixed, so installing 4.42.xxx should work. I just tested locally: Make sure to install pip install -U transformers

Jun 28 '24 16:06 ArthurZucker

Hey! I think this was recently fixed, so installing 4.42.xxx should work. I just tested locally: Make sure to install pip install -U transformers

Thank you @ArthurZucker , now the tokenizer works well. However, when I try to save and then load it, another error occurs:RuntimeError: Internal: could not parse ModelProto from ...

Code:

from transformers import AutoTokenizer
model_id = "QuantFactory/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B.Q4_K_M.gguf"
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)

save_dir = '../../deq_models/test'
tokenizer.save_pretrained(save_dir)

tokenizer2 = AutoTokenizer.from_pretrained(save_dir)

The package version: sentencepiece 0.2.0 transformers 4.42.3 Traceback:

{
	"name": "RuntimeError",
	"message": "Internal: could not parse ModelProto from ../../deq_models/test/tokenizer.model",
	"stack": "---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[6], line 1
----> 1 tokenizer2 = AutoTokenizer.from_pretrained(save_dir)

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py:889, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    885     if tokenizer_class is None:
    886         raise ValueError(
    887             f\"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported.\"
    888         )
--> 889     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    891 # Otherwise we have to be creative.
    892 # if model is an encoder decoder, the encoder tokenizer class is used by default
    893 if isinstance(config, EncoderDecoderConfig):

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2163, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
   2160     else:
   2161         logger.info(f\"loading file {file_path} from cache at {resolved_vocab_files[file_id]}\")
-> 2163 return cls._from_pretrained(
   2164     resolved_vocab_files,
   2165     pretrained_model_name_or_path,
   2166     init_configuration,
   2167     *init_inputs,
   2168     token=token,
   2169     cache_dir=cache_dir,
   2170     local_files_only=local_files_only,
   2171     _commit_hash=commit_hash,
   2172     _is_local=is_local,
   2173     trust_remote_code=trust_remote_code,
   2174     **kwargs,
   2175 )

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2397, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
   2395 # Instantiate the tokenizer.
   2396 try:
-> 2397     tokenizer = cls(*init_inputs, **init_kwargs)
   2398 except OSError:
   2399     raise OSError(
   2400         \"Unable to load vocabulary from file. \"
   2401         \"Please check that the provided vocabulary is accessible and not corrupted.\"
   2402     )

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama_fast.py:157, in LlamaTokenizerFast.__init__(self, vocab_file, tokenizer_file, clean_up_tokenization_spaces, unk_token, bos_token, eos_token, add_bos_token, add_eos_token, use_default_system_prompt, legacy, add_prefix_space, **kwargs)
    154 if add_prefix_space is not None:
    155     kwargs[\"from_slow\"] = True
--> 157 super().__init__(
    158     vocab_file=vocab_file,
    159     tokenizer_file=tokenizer_file,
    160     clean_up_tokenization_spaces=clean_up_tokenization_spaces,
    161     unk_token=unk_token,
    162     bos_token=bos_token,
    163     eos_token=eos_token,
    164     add_bos_token=add_bos_token,
    165     add_eos_token=add_eos_token,
    166     use_default_system_prompt=use_default_system_prompt,
    167     add_prefix_space=add_prefix_space,
    168     legacy=legacy,
    169     **kwargs,
    170 )
    171 self._add_bos_token = add_bos_token
    172 self._add_eos_token = add_eos_token

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py:131, in PreTrainedTokenizerFast.__init__(self, *args, **kwargs)
    127         kwargs.update(additional_kwargs)
    129 elif self.slow_tokenizer_class is not None:
    130     # We need to create and convert a slow tokenizer to build the backend
--> 131     slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs)
    132     fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
    133 else:

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py:171, in LlamaTokenizer.__init__(self, vocab_file, unk_token, bos_token, eos_token, pad_token, sp_model_kwargs, add_bos_token, add_eos_token, clean_up_tokenization_spaces, use_default_system_prompt, spaces_between_special_tokens, legacy, add_prefix_space, **kwargs)
    169 self.add_eos_token = add_eos_token
    170 self.use_default_system_prompt = use_default_system_prompt
--> 171 self.sp_model = self.get_spm_processor(kwargs.pop(\"from_slow\", False))
    172 self.add_prefix_space = add_prefix_space
    174 super().__init__(
    175     bos_token=bos_token,
    176     eos_token=eos_token,
   (...)
    187     **kwargs,
    188 )

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py:198, in LlamaTokenizer.get_spm_processor(self, from_slow)
    196 tokenizer = spm.SentencePieceProcessor(**self.sp_model_kwargs)
    197 if self.legacy or from_slow:  # no dependency on protobuf
--> 198     tokenizer.Load(self.vocab_file)
    199     return tokenizer
    201 with open(self.vocab_file, \"rb\") as f:

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/sentencepiece/__init__.py:961, in SentencePieceProcessor.Load(self, model_file, model_proto)
    959 if model_proto:
    960   return self.LoadFromSerializedProto(model_proto)
--> 961 return self.LoadFromFile(model_file)

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/sentencepiece/__init__.py:316, in SentencePieceProcessor.LoadFromFile(self, arg)
    315 def LoadFromFile(self, arg):
--> 316     return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)

RuntimeError: Internal: could not parse ModelProto from ../../deq_models/test/tokenizer.model"
}

Jun 29 '24 12:06 Lin-xs

Sorry for the misoperation.

Jun 29 '24 12:06 Lin-xs

cc @itazap

Jul 12 '24 10:07 ArthurZucker

I found that this is caused by setting add_prefix_space=False in GGUFLammaConverter. In turn, then the from_slow=True is forced from #28010 . I checked loading from "meta-llama/Meta-Llama-3-8B" and I don't believe the add_prefix_space=False [#30391] is necessary, I checked tokenization and a prefix space is not added when set to None. I can push a fix to change it to add_prefix_space=None (& test!) unless @ArthurZucker sees an issue with this?

@Lin-xs as a workaround for now, you can pass add_prefix_space=False like below to avoid the error!

AutoTokenizer.from_pretrained(model_id, gguf_file=filename, add_prefix_space=False)

Jul 12 '24 15:07 itazap