LlamaCpp model crashes with multi-token characters
The bug
A strings containing certain unicode characters to causes an exception.
Likely because 歪 is a multi-token characters for this tokenizer
llama3.engine.tokenizer('歪'.encode('utf8')) -> [15722, 103]
I also tested transformers model which seems to be working fine
To Reproduce
from guidance import models, select
llama3 = models.LlamaCpp('./Meta-Llama-3-8B-Instruct.Q4_0.gguf')
llama3 + '歪' + select(['打正着','门邪道'])
terminate called after throwing an instance of 'std::invalid_argument'
what(): invalid character
Aborted
System info (please complete the following information): Ubuntu 22.04 Python 3.10.12
guidance==0.1.15 llama_cpp_python==0.2.79
Hi @knilink , thanks for reporting this! Do you know if this happens if you try to generate with llama-cpp-python directly? Getting the full stack trace here would be very helpful!
@paulbkoch might have thoughts here too
Hi @Harsha-Nori I did a bit more investigation and can confirm the error was caused by sending incomplete Unicode bytes to llama_cpp tokenizer
$ printf '\xe6\xad' | ./llama-tokenize -m ./Meta-Llama-3-8B-Instruct.Q8_0.gguf --stdin
terminate called after throwing an instance of 'std::invalid_argument'
what(): invalid character
Aborted
After adding byte_string.decode('utf8') before
https://github.com/guidance-ai/guidance/blob/337738322f7d09f36613a4c40f86137c3a0a1553/guidance/models/llama_cpp/_llama_cpp.py#L78
I got the following stack trace:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
Cell In[21], line 4
2 # guidance.models._model.ipython_is_imported = False
3 llama3 = LlamaCpp('/home/jovyan/cache/Meta-Llama-3-8B-Instruct.Q8_0.gguf', file_name='',chat_template=chat.Llama3ChatTemplate, n_gpu_layers=-1,)
----> 4 llama3 + '歪' + select(['打正着','门邪道']) + gen(stop='。')
File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:1159, in Model.__add__(self, value)
1157 # run stateless functions (grammar nodes)
1158 elif isinstance(value, GrammarFunction):
-> 1159 out = lm._run_stateless(value)
1161 # run stateful functions
1162 else:
1163 out = value(lm)
File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:1364, in Model._run_stateless(self, stateless_function, temperature, top_p, n)
1362 delayed_bytes = b""
1363 # last_is_generated = False
-> 1364 for chunk in gen_obj:
1365
1366 # we make everything full probability if we are not computing uncertainty
1367 # if not self.engine.compute_log_probs:
1368 # chunk.new_bytes_prob = 1.0
1369
1370 # convert the bytes to a string (delaying if we don't yet have a valid unicode string)
1371 lm.token_count += chunk.new_token_count
1372 chunk.new_bytes = delayed_bytes + chunk.new_bytes
File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:732, in Engine.__call__(self, parser, grammar, ensure_bos_token)
717 def __call__(self, parser, grammar, ensure_bos_token=True):
718 """Returns a new updated parser state executed through the grammar.
719
720 Parameters
(...)
729 This is the grammar we are extending the parser with.
730 """
--> 732 self.start(parser, grammar, ensure_bos_token)
734 logits = None
735 while True:
File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:264, in Engine.start(self, parser, grammar, ensure_bos_token)
262 # run a simple tokenizer (that does not use a grammar) on the prefix for better performance
263 self._token_ids, self._token_byte_positions = self._tokenize_prefix(prompt)
--> 264 self._token_ids, self._token_byte_positions = self._cleanup_tokens(
265 self._token_ids, self._token_byte_positions
266 )
267 if len(self._token_byte_positions) > 0:
268 self._pre_parser_bytes = self._token_byte_positions[-1]
File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:808, in Engine._cleanup_tokens(self, token_ids, token_byte_positions)
805 def _cleanup_tokens(self, token_ids, token_byte_positions):
806
807 # compute a joint tokenization
--> 808 joint_token_ids = self._joint_tokenize(token_ids)
810 # see if we need to redo the tokenization
811 redo = False
Cell In[20], line 151, in LlamaCppEngine._joint_tokenize(self, token_ids)
149 """What a full joint tokenizer would give for a given byte string"""
150 byte_string = b"".join([self.tokenizer.tokens[t] for t in token_ids])
--> 151 return self.tokenizer(byte_string)
Cell In[20], line 81, in LlamaCppTokenizer.__call__(self, byte_string)
79 print('[LlamaCppTokenizer] begin', flush=True)
80 print(byte_string, flush=True)
---> 81 print(byte_string.decode('utf8'), flush=True)
82 res = self._model_obj.tokenize(byte_string, add_bos=False, special=True)
83 print('[LlamaCppTokenizer] end', flush=True)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 17-18: unexpected end of data
Transformer model didn't have the issue because its _joint_tokenize didn't use the tokenizer directly.
I didn't do much testing but copy TransformersEngine._joint_tokenize over to LlamaCppEngine seem to get the issue fixed.
@knilink , thank you for bringing this up. I've drafted a (very) tentative fix in #962 , which works by chopping off bytes given to the encode() method until it has a valid UTF-8 string. However, I'm really concerned that this is going to be causing trouble for us elsewhere.
Have you filed your repro printf '\xe6\xad' | ./llama-tokenize -m ./Meta-Llama-3-8B-Instruct.Q8_0.gguf --stdin as a bug with llamacpp?
I have been doing some more prodding based on @knilink 's examples, and I've opened a bug on the HF repo whence I grabbed the model (although this does look like something going wrong at the LlamaCpp layer): https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/discussions/9
Also filed the bug on LlamaCpp https://github.com/ggerganov/llama.cpp/issues/8691