llama.cpp
llama.cpp copied to clipboard
Prompt interrupted before continuation for Unicode UTF-8 emojis
I have found that when having a Unicode UTF- emoji char like
Unicode Character “👍” (U+1F44D)
The prompts breaks up.
I'm reading a sample prompt from a text file:
cat prompt
Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 👍"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:
Looking at logs I can see in fact that the tokenizers breaks at the (U+1F44D) char code:
(base)$ p=$(cat prompt); ./main -m ./models/13B/ggml-model-q4_0.bin -p $p -t 4 -n 512
main: seed = 1678656464
llama_model_load: loading model from './models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 5120
llama_model_load: n_mult = 256
llama_model_load: n_head = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size = 800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from './models/13B/ggml-model-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size = 3880.49 MB / num tensors = 363
llama_model_load: loading model part 2/2 from './models/13B/ggml-model-q4_0.bin.1'
llama_model_load: ............................................. done
llama_model_load: model size = 3880.49 MB / num tensors = 363
main: prompt: 'Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 👍"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:'
main: number of tokens in prompt = 36
1 -> ''
27418 -> 'Tw'
3905 -> 'ee'
29873 -> 't'
29901 -> ':'
376 -> ' "'
29902 -> 'I'
26277 -> ' hate'
372 -> ' it'
746 -> ' when'
590 -> ' my'
9008 -> ' phone'
16988 -> ' battery'
2977 -> ' dies'
1213 -> '."'
13 -> '
'
2008 -> 'Se'
593 -> 'nt'
2073 -> 'iment'
29901 -> ':'
12610 -> ' Neg'
1230 -> 'ative'
13 -> '
'
2277 -> '##'
29937 -> '#'
13 -> '
'
27418 -> 'Tw'
3905 -> 'ee'
29873 -> 't'
29901 -> ':'
376 -> ' "'
3421 -> 'My'
2462 -> ' day'
756 -> ' has'
1063 -> ' been'
29871 -> ' '
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 10 times better than yesterday. Now I have to sleep again..."
Sentiment: Neutral
###
Twitter is not about talking; Twitter is a social network for listening and responding instantly, as the tweets of Steve Jobs demonstrate well in Figure A-2 (page ). Just be sure you can interpret the information accurately. If the sentiment isn't clearly positive or negative—as^C
resulting in a broken input prompt.
Its unable to support emojis without the unicode support fix. I have a branch or we can wait until more work is done here.
PR: https://github.com/ggerganov/llama.cpp/pull/66
Branch: https://github.com/beiller/llama.cpp/tree/feature/tokenization
I believe this was fixed by #79