llama.cpp Prompt interrupted before continuation for Unicode UTF-8 emojis

I have found that when having a Unicode UTF- emoji char like

Unicode Character “👍” (U+1F44D)

The prompts breaks up.

I'm reading a sample prompt from a text file:

cat prompt

Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 👍"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:

Looking at logs I can see in fact that the tokenizers breaks at the (U+1F44D) char code:

(base)$ p=$(cat prompt); ./main -m ./models/13B/ggml-model-q4_0.bin -p $p -t 4 -n 512
main: seed = 1678656464
llama_model_load: loading model from './models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size =   800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from './models/13B/ggml-model-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363
llama_model_load: loading model part 2/2 from './models/13B/ggml-model-q4_0.bin.1'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363

main: prompt: 'Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 👍"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:'
main: number of tokens in prompt = 36
     1 -> ''
 27418 -> 'Tw'
  3905 -> 'ee'
 29873 -> 't'
 29901 -> ':'
   376 -> ' "'
 29902 -> 'I'
 26277 -> ' hate'
   372 -> ' it'
   746 -> ' when'
   590 -> ' my'
  9008 -> ' phone'
 16988 -> ' battery'
  2977 -> ' dies'
  1213 -> '."'
    13 -> '
'
  2008 -> 'Se'
   593 -> 'nt'
  2073 -> 'iment'
 29901 -> ':'
 12610 -> ' Neg'
  1230 -> 'ative'
    13 -> '
'
  2277 -> '##'
 29937 -> '#'
    13 -> '
'
 27418 -> 'Tw'
  3905 -> 'ee'
 29873 -> 't'
 29901 -> ':'
   376 -> ' "'
  3421 -> 'My'
  2462 -> ' day'
   756 -> ' has'
  1063 -> ' been'
 29871 -> ' '

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 10 times better than yesterday. Now I have to sleep again..."
Sentiment: Neutral
###
Twitter is not about talking; Twitter is a social network for listening and responding instantly, as the tweets of Steve Jobs demonstrate well in Figure A-2 (page ). Just be sure you can interpret the information accurately. If the sentiment isn't clearly positive or negative—as^C

resulting in a broken input prompt.

Mar 12 '23 21:03 loretoparisi

Its unable to support emojis without the unicode support fix. I have a branch or we can wait until more work is done here.

PR: https://github.com/ggerganov/llama.cpp/pull/66

Branch: https://github.com/beiller/llama.cpp/tree/feature/tokenization

Mar 13 '23 01:03 beiller

I believe this was fixed by #79

Apr 01 '23 07:04 sw

llama.cpp llama.cpp copied to clipboard

Prompt interrupted before continuation for Unicode UTF-8 emojis

llama.cpp
llama.cpp copied to clipboard