guidance LlamaCpp model crashes with multi-token characters

The bug A strings containing certain unicode characters to causes an exception. Likely because 歪 is a multi-token characters for this tokenizer

llama3.engine.tokenizer('歪'.encode('utf8')) -> [15722, 103]

I also tested transformers model which seems to be working fine

To Reproduce

from guidance import models, select
llama3 = models.LlamaCpp('./Meta-Llama-3-8B-Instruct.Q4_0.gguf')
llama3 + '歪' + select(['打正着','门邪道'])

terminate called after throwing an instance of 'std::invalid_argument'
  what():  invalid character
Aborted

System info (please complete the following information): Ubuntu 22.04 Python 3.10.12

guidance==0.1.15 llama_cpp_python==0.2.79

Jun 30 '24 00:06 knilink

Hi @knilink , thanks for reporting this! Do you know if this happens if you try to generate with llama-cpp-python directly? Getting the full stack trace here would be very helpful!

@paulbkoch might have thoughts here too

Jul 03 '24 16:07 Harsha-Nori

Hi @Harsha-Nori I did a bit more investigation and can confirm the error was caused by sending incomplete Unicode bytes to llama_cpp tokenizer

$ printf '\xe6\xad' | ./llama-tokenize -m ./Meta-Llama-3-8B-Instruct.Q8_0.gguf --stdin
terminate called after throwing an instance of 'std::invalid_argument'
  what():  invalid character
Aborted

After adding byte_string.decode('utf8') before https://github.com/guidance-ai/guidance/blob/337738322f7d09f36613a4c40f86137c3a0a1553/guidance/models/llama_cpp/_llama_cpp.py#L78 I got the following stack trace:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[21], line 4
      2 # guidance.models._model.ipython_is_imported = False
      3 llama3 = LlamaCpp('/home/jovyan/cache/Meta-Llama-3-8B-Instruct.Q8_0.gguf', file_name='',chat_template=chat.Llama3ChatTemplate, n_gpu_layers=-1,)
----> 4 llama3 + '歪' + select(['打正着','门邪道']) + gen(stop='。')

File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:1159, in Model.__add__(self, value)
   1157 # run stateless functions (grammar nodes)
   1158 elif isinstance(value, GrammarFunction):
-> 1159     out = lm._run_stateless(value)
   1161 # run stateful functions
   1162 else:
   1163     out = value(lm)

File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:1364, in Model._run_stateless(self, stateless_function, temperature, top_p, n)
   1362 delayed_bytes = b""
   1363 # last_is_generated = False
-> 1364 for chunk in gen_obj:
   1365 
   1366     # we make everything full probability if we are not computing uncertainty
   1367     # if not self.engine.compute_log_probs:
   1368     #     chunk.new_bytes_prob = 1.0
   1369 
   1370     # convert the bytes to a string (delaying if we don't yet have a valid unicode string)
   1371     lm.token_count += chunk.new_token_count
   1372     chunk.new_bytes = delayed_bytes + chunk.new_bytes

File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:732, in Engine.__call__(self, parser, grammar, ensure_bos_token)
    717 def __call__(self, parser, grammar, ensure_bos_token=True):
    718     """Returns a new updated parser state executed through the grammar.
    719 
    720     Parameters
   (...)
    729         This is the grammar we are extending the parser with.
    730     """
--> 732     self.start(parser, grammar, ensure_bos_token)
    734     logits = None
    735     while True:

File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:264, in Engine.start(self, parser, grammar, ensure_bos_token)
    262 # run a simple tokenizer (that does not use a grammar) on the prefix for better performance
    263 self._token_ids, self._token_byte_positions = self._tokenize_prefix(prompt)
--> 264 self._token_ids, self._token_byte_positions = self._cleanup_tokens(
    265     self._token_ids, self._token_byte_positions
    266 )
    267 if len(self._token_byte_positions) > 0:
    268     self._pre_parser_bytes = self._token_byte_positions[-1]

File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:808, in Engine._cleanup_tokens(self, token_ids, token_byte_positions)
    805 def _cleanup_tokens(self, token_ids, token_byte_positions):
    806 
    807     # compute a joint tokenization
--> 808     joint_token_ids = self._joint_tokenize(token_ids)
    810     # see if we need to redo the tokenization
    811     redo = False

Cell In[20], line 151, in LlamaCppEngine._joint_tokenize(self, token_ids)
    149 """What a full joint tokenizer would give for a given byte string"""
    150 byte_string = b"".join([self.tokenizer.tokens[t] for t in token_ids])
--> 151 return self.tokenizer(byte_string)

Cell In[20], line 81, in LlamaCppTokenizer.__call__(self, byte_string)
     79 print('[LlamaCppTokenizer] begin', flush=True)
     80 print(byte_string, flush=True)
---> 81 print(byte_string.decode('utf8'), flush=True)
     82 res = self._model_obj.tokenize(byte_string, add_bos=False, special=True)
     83 print('[LlamaCppTokenizer] end', flush=True)

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 17-18: unexpected end of data

Transformer model didn't have the issue because its _joint_tokenize didn't use the tokenizer directly. I didn't do much testing but copy TransformersEngine._joint_tokenize over to LlamaCppEngine seem to get the issue fixed.

Jul 07 '24 15:07 knilink

@knilink , thank you for bringing this up. I've drafted a (very) tentative fix in #962 , which works by chopping off bytes given to the encode() method until it has a valid UTF-8 string. However, I'm really concerned that this is going to be causing trouble for us elsewhere.

Have you filed your repro printf '\xe6\xad' | ./llama-tokenize -m ./Meta-Llama-3-8B-Instruct.Q8_0.gguf --stdin as a bug with llamacpp?

Jul 24 '24 16:07 riedgar-ms

I have been doing some more prodding based on @knilink 's examples, and I've opened a bug on the HF repo whence I grabbed the model (although this does look like something going wrong at the LlamaCpp layer): https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/discussions/9

Jul 24 '24 18:07 riedgar-ms

Also filed the bug on LlamaCpp https://github.com/ggerganov/llama.cpp/issues/8691

Jul 25 '24 13:07 riedgar-ms