llama.cpp
llama.cpp copied to clipboard
Unicode support
Thannk you for creating such a great inference engine which has 10x speedup. Please add Unocode support to display other language properly.

I tried to determine how to implement unicode and I am not getting far. It seems to work from all I am seeing, but the output has random characters yes.
Here is a prompt in text format for easier copy/paste
./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 -n 512 -p $'人生の意味は'
1 -> ''
30313 -> '人'
30486 -> '生'
30199 -> 'の'
31474 -> '意'
This seems correct above since I dumped out the tokens parsing code
llama_model_load: vocab[30313] = '人'
llama_model_load: vocab[30486] = '生'
llama_model_load: vocab[30199] = 'の'
llama_model_load: vocab[31474] = '意'
And the output I get is
人生の意���、フロントカードに���いてる。 2019年3月 © All Rights Reserved. [end of text]
So it is outputting some characters but some �
llama_model_load: vocab[30140] = '�'
I find a list of unprintable tokens from ID 131 to 258. If I remove those from vocab a prompt can generate in Japanese it seems but I dont know Japanese!
llama.cpp % ./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 --repeat_last_n 64 --repeat_penalty 1.0 -n 512 -p $'人生の意味は'
Response
人生の意は、一人が一人ということであります。は安部が立していたので、去からは一人の人にれるのはにとどまったのですが、そう
Google translate
The meaning of life is that one person is one person. Since Abe was standing there, it was only possible to be one person after leaving, but that's right.
Is it possible?
Response
人生の意は、一人が一人ということであります。は安部が立していたので、去からは一人の人にれるのはにとどまったのですが、そう
The Japanese text you quote here is fairly agrammatical, in a way that suggests (on top of some other issues that I figure are simply due to LLaMa not having learned the language very well) that some words are simply missing. Where were the unprintable tokens that you removed from this?
I removed "", "�", "��" from the grammar, not from a sentence that's not how it works. There is a large chunk of the "token dictionary" in the model that points to unprintable character �. I remove those tokens from the dictionary of tokens the program is using. I suspect the model learns some corrupted text maybe during training so if it sees japanese characters it is confusing it with some garbled text it has come across, thus making unprintable characters a likely candidate for the next word. Just my hypothesis.
Here is the pull request, the code change I made to make this work.
https://github.com/ggerganov/llama.cpp/pull/26/files
For anyone interested here is the chunk in the 13B model file. Not sure if all models contain the same token grammars
� vocab[131] (EFBFBD)
� vocab[132] (EFBFBD)
� vocab[133] (EFBFBD)
� vocab[134] (EFBFBD)
� vocab[135] (EFBFBD)
� vocab[136] (EFBFBD)
� vocab[137] (EFBFBD)
� vocab[138] (EFBFBD)
� vocab[139] (EFBFBD)
� vocab[140] (EFBFBD)
� vocab[141] (EFBFBD)
� vocab[142] (EFBFBD)
� vocab[143] (EFBFBD)
� vocab[144] (EFBFBD)
� vocab[145] (EFBFBD)
� vocab[146] (EFBFBD)
� vocab[147] (EFBFBD)
� vocab[148] (EFBFBD)
� vocab[149] (EFBFBD)
� vocab[150] (EFBFBD)
� vocab[151] (EFBFBD)
� vocab[152] (EFBFBD)
� vocab[153] (EFBFBD)
� vocab[154] (EFBFBD)
� vocab[155] (EFBFBD)
� vocab[156] (EFBFBD)
� vocab[157] (EFBFBD)
� vocab[158] (EFBFBD)
� vocab[159] (EFBFBD)
� vocab[160] (EFBFBD)
� vocab[161] (EFBFBD)
� vocab[162] (EFBFBD)
� vocab[163] (EFBFBD)
� vocab[164] (EFBFBD)
� vocab[165] (EFBFBD)
� vocab[166] (EFBFBD)
� vocab[167] (EFBFBD)
� vocab[168] (EFBFBD)
� vocab[169] (EFBFBD)
� vocab[170] (EFBFBD)
� vocab[171] (EFBFBD)
� vocab[172] (EFBFBD)
� vocab[173] (EFBFBD)
� vocab[174] (EFBFBD)
� vocab[175] (EFBFBD)
� vocab[176] (EFBFBD)
� vocab[177] (EFBFBD)
� vocab[178] (EFBFBD)
� vocab[179] (EFBFBD)
� vocab[180] (EFBFBD)
� vocab[181] (EFBFBD)
� vocab[182] (EFBFBD)
� vocab[183] (EFBFBD)
� vocab[184] (EFBFBD)
� vocab[185] (EFBFBD)
� vocab[186] (EFBFBD)
� vocab[187] (EFBFBD)
� vocab[188] (EFBFBD)
� vocab[189] (EFBFBD)
� vocab[190] (EFBFBD)
� vocab[191] (EFBFBD)
� vocab[192] (EFBFBD)
� vocab[193] (EFBFBD)
� vocab[194] (EFBFBD)
� vocab[195] (EFBFBD)
� vocab[196] (EFBFBD)
� vocab[197] (EFBFBD)
� vocab[198] (EFBFBD)
� vocab[199] (EFBFBD)
� vocab[200] (EFBFBD)
� vocab[201] (EFBFBD)
� vocab[202] (EFBFBD)
� vocab[203] (EFBFBD)
� vocab[204] (EFBFBD)
� vocab[205] (EFBFBD)
� vocab[206] (EFBFBD)
� vocab[207] (EFBFBD)
� vocab[208] (EFBFBD)
� vocab[209] (EFBFBD)
� vocab[210] (EFBFBD)
� vocab[211] (EFBFBD)
� vocab[212] (EFBFBD)
� vocab[213] (EFBFBD)
� vocab[214] (EFBFBD)
� vocab[215] (EFBFBD)
� vocab[216] (EFBFBD)
� vocab[217] (EFBFBD)
� vocab[218] (EFBFBD)
� vocab[219] (EFBFBD)
� vocab[220] (EFBFBD)
� vocab[221] (EFBFBD)
� vocab[222] (EFBFBD)
� vocab[223] (EFBFBD)
� vocab[224] (EFBFBD)
� vocab[225] (EFBFBD)
� vocab[226] (EFBFBD)
� vocab[227] (EFBFBD)
� vocab[228] (EFBFBD)
� vocab[229] (EFBFBD)
� vocab[230] (EFBFBD)
� vocab[231] (EFBFBD)
� vocab[232] (EFBFBD)
� vocab[233] (EFBFBD)
� vocab[234] (EFBFBD)
� vocab[235] (EFBFBD)
� vocab[236] (EFBFBD)
� vocab[237] (EFBFBD)
� vocab[238] (EFBFBD)
� vocab[239] (EFBFBD)
� vocab[240] (EFBFBD)
� vocab[241] (EFBFBD)
� vocab[242] (EFBFBD)
� vocab[243] (EFBFBD)
� vocab[244] (EFBFBD)
� vocab[245] (EFBFBD)
� vocab[246] (EFBFBD)
� vocab[247] (EFBFBD)
� vocab[248] (EFBFBD)
� vocab[249] (EFBFBD)
� vocab[250] (EFBFBD)
� vocab[251] (EFBFBD)
� vocab[252] (EFBFBD)
� vocab[253] (EFBFBD)
� vocab[254] (EFBFBD)
� vocab[255] (EFBFBD)
� vocab[256] (EFBFBD)
� vocab[257] (EFBFBD)
� vocab[258] (EFBFBD)
�� vocab[26308] (EFBFBDEFBFBD)
 vocab[31634] (EFBFBC)
Many token IDs point to 0xEFBFBD
which is unprintable unicode
Nice find!
Due to the constently changing encoding history of CJK (Chinese, Japanese, Korean), there is big chance that the training model got wrong encoding of non-ascii language. Simply removing it is good.
Some more test shows that we can't simply remove the unprintable token. There should be some way to find the right encoding of it. otherwise the generated text becomes unreadable.

Im not sure you have applied the code change. I cannot try your prompt since its an image mind pasting? But I think you have to checkout from my repo because my code is not currently merged here yet.
https://github.com/beiller/llama.cpp
Try clone / build in a different folder. It also includes the repeat penalty change. Again my approach is not about removing the characters, its a code change that will output something very different (and more accurate)
Heres some more examples:
./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -p $'人生の意味は'
Outputs:
...
main: prompt: '人生の意味は'
main: number of tokens in prompt = 5
1 -> ''
30313 -> '人'
30486 -> '生'
30199 -> 'の'
31474 -> '意'
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
人生の意はあまり
物学に世界をする発見 : 空気中でニトルを放つシステムが全球ワー... [end of text]
人生の意から
しようごいます。
I’ve been getting these error messages on my blog for a while now, but I didn't pay too much attention to them until recently when there were quite few. Maybe it was because last weekend (in Japan), the phone lines at my company went down and that screwed up all of our accounts that require login through Internet Explorer 7+ (which is set as default).
Unfortunately, I couldn't afford much time to fix them since they were so many. So now there are even more errors for you guys
人生の意とか、やりめているんだ [ 1 ]
部事情はもうそこまでのキャバラ
く立みなしに上下。自分がよききたかったと、子あればえるだけ作らない人は多数だ [ 2 ]
【キャバラ】 (ビジネスの了)
く
人生の意は知らない。 我が人生は事の意や子をう きた人の自分である人生に存するから よく、意のい期に実力から人の意を心したり、知らない人生の意は子をうしていることが、
Thank you. I build your
Im not sure you have applied the code change. I cannot try your prompt since its an image mind pasting? But I think you have to checkout from my repo because my code is not currently merged here yet.
https://github.com/beiller/llama.cpp
Try clone / build in a different folder. It also includes the repeat penalty change. Again my approach is not about removing the characters, its a code change that will output something very different (and more accurate)
Thank you. I build your repo and test again, the unprintable character is gone, but the meaning of generated text is gone either like bellow.
There is another bug, truncate of prompt if it is Chinese like in https://github.com/ggerganov/llama.cpp/issues/11#issuecomment-1465083826
I just tried Chinese as well and yes its truncated. Its possible that it doesn't understand other languages. It seems to be missing some Chinese character tokens such as 篇
entirely!
Further up the code chain, in the model conversion code I see the following. Before I write more @ggerganov thank you so much for putting this all together. I wonder if some tokens are getting lost. But maybe not since there is 32000 tokens (and that appears to be how Google's tokenizer works). I will try to research and see if some tokens are "lost in translation"!
# Is this correct??
for i in range(32000):
# TODO: this is probably wrong - not sure how this tokenizer works
text = tokenizer.decode([29889, i]).encode('utf-8')
# remove the first byte (it's always '.')
text = text[1:]
fout.write(struct.pack("i", len(text)))
fout.write(text)
@ggerganov you are too hard on yourself. How can you be wrong when so many tokens are present :P
I found the problem via some scripting. The tokenizer works differently than we are using it. Also, token 29889
is .
so that is why @ggerganov has to remove the .
character he is tokenizing it but that is not affecting anything so that is good!
from sentencepiece import SentencePieceProcessor
fname_tokenizer = "models/tokenizer.model"
tokenizer = SentencePieceProcessor(fname_tokenizer)
print(tokenizer.decode([29889]))
>>>.
result1 = tokenizer.encode("篇")
print(f'token: {result1}')
>>>[29871, 234, 178, 138]
decode1 = tokenizer.decode(result1)
print(f'decoded: {decode1}')
>>>decoded: 篇
So it seems we need to leverage this tokenizer in the C++ code, the current method of tokenizing is not correct. I can attemp it, it will require adding sentencepiece.
The crux of the issue if I can try to explain, is the C++ tries to find the best matching token (single token) in the input text. But as we see here, the actual token for this character needs to be multiple tokens! Strange. I think the tokens can be removed from the model files in the conversion script and we should just use sentencepiece C++ code. Thoughts??
dump the tokenizer.model file to text by `import sentencepiece as spm
sp = spm.SentencePieceProcessor() sp.load('tokenizer.model')
vocab_list = [sp.id_to_piece(id) for id in range(sp.get_piece_size())]
with open('vocab.txt', 'w', encoding='utf-8') as f: f.write('\n'.join(vocab_list))`
did not found some char like '篇', '雨','许'
@wizd see my comment its more complex it seems and will "translate" to multiple tokens but it will actually support Chinese I believe with some big refactors :P I may have time to make it work we will see!
Edit
I think we just found out what that big chunk of unprintable characters is for :)
[29871, 234, 178, 138] translates to:
0x20, 0xEFBFBD, 0xEFBFBD, 0xEFBFBD
AKA ���
But in actuality it should be:
篇
trying to understand it... https://unicode.scarfboy.com/?s=%E7%AF%87
seems we should use this library to tokenize: https://github.com/google/sentencepiece
@wizd yes that is correct. And the code also assumes a 1 "word" to 1 token mapping which isn't the case. Also "word" is not a word its more like a word piece.
Yep. and need to make the code UTF-8 aware: https://github.com/facebookresearch/llama/blob/main/FAQ.md#4-other-languages
The code has no problem with UTF-8 so far. I am working on a very hacky solution right now :)
I actually got it working in a very hacky way. Example:
./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 16 --repeat_last_n 16 -p $'J\'aime le chocolat = I like chocolate\n祝你一天过得愉快 ='
Output:
J'aime le chocolat = I like chocolate
祝你一天过得愉快 = Have a happy holiday
我喜欢 来自中国
./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 --repeat_last_n 64 -p $'什么是生命的意义?生命的意义在于'
Output (Admittedly cherry picked, sometimes it contains half english):
什么是生命的意义?生命的意义在于存活,这就上个理论说了所有。人类对现实世界做了出不可思当的窗口式巡回行为,而又没能找到真正天生份目前且无需死去才是最大缺少之处。
wow, you are so cool! @beiller
Another interesting outcome, it actually can output emojis now!
./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 --repeat_last_n 64 -p $'Have you heard this funny joke? '
Have you heard this funny joke? 😆
Are There Any Real Vegans In This House??…If Not, Now’s Your Chance To Change That.
This Is About What Would Happen If You Drank A Gallon of Milk Every Day For One Year
Sadly the joke was not funny or even a joke.
I actually got it working in a very hacky way. Example:
./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 16 --repeat_last_n 16 -p $'J\'aime le chocolat = I like chocolate\n祝你一天过得愉快 ='
Output:
J'aime le chocolat = I like chocolate 祝你一天过得愉快 = Have a happy holiday 我喜欢 来自中国
./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 --repeat_last_n 64 -p $'什么是生命的意义?生命的意义在于'
Output (Admittedly cherry picked, sometimes it contains half english):
什么是生命的意义?生命的意义在于存活,这就上个理论说了所有。人类对现实世界做了出不可思当的窗口式巡回行为,而又没能找到真正天生份目前且无需死去才是最大缺少之处。
this output has misunderstanding, maybe still encoding issue?
maybe we can use some fact check to verify the output. e.g.
关于爱因斯坦的生平。他出生于 About the life of Einstein. He was born in
if the output is wrong, we can catch it easily.
Response
关于爱因斯坦的生平。他出生于1856年,但已经成为了一个名人在1902年之后,是世界上最知名的数学家和科学家
English response (died in the future lol)
About the life of Einstein. He was born in 1879, he died on April 10th ,2054; his parents were Hermann and Pauline Winteler-Einstein . His first job as a clerk at Bern patent office where
I think the strange responses are due to me using smaller 13B model. Maybe bigger model is more accurate. I think the unicode issue is resolved (very hacky tho).
@ggerganov I can fix it but I suck with c++ and won't know how to include the sentencepiece c++ library and have it compile and link. I need help with that part. I had the C++ call python functions for my solve here since it was already installed but that is a travesty I'm sure.
@beiller Adding sentencepiece to the project will be last resort - don't want to bloat the repo too much. Hopefully, we can figure out a concise C++ implementation
@beiller key thing to be aware of: tokenizer works on bytes, not on characters. so:
- one character could take multiple tokens (up to 4 without emoji support, 10+ with emojis)
- tokens can and will start and/or end in the middle of multibyte characters
- tokenization is not unique (there could be multiple ways to encode given text)
- decoding some token sequences you’ll sometimes get invalid utf-8, this is to be expected
some research. I use sentencepiece to tokenize a input and dump it. I got this:
piece: ▁ piece: <0xE7> piece: <0xAF> piece: <0x87> piece: <0xE5> piece: <0xB9> piece: <0x85> piece: 已 piece: 经 1 31290 31412
main: prompt: '篇幅已经' main: number of tokens in prompt = 3 1 -> '' 31290 -> '已' 31412 -> '经'
"篇幅" is not found because in vocab table it is not what it is, but <0xE7>, <0xAF> ... etc.
with sentencepiece which full of magic number I can get the result right:
main: prompt: '篇幅已经' main: number of tokens in prompt = 10 1 -> '' 29871 -> '▁' 234 -> '' 178 -> '' 138 -> '' 232 -> '' 188 -> '' 136 -> '' 31290 -> '已' 31412 -> '经' sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000 ▁已经达了50万,我们再期望完成到全部五十平民的目标最多能安放自去生活。如果你有人看段情景不好▁就可以关注在线客(部分)▁这里为我们一起同行,因此地方▁全是在他实现的目标。参与将从未来开始开展! [end of text]