llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Unicode support

Open wizd opened this issue 1 year ago • 30 comments

Thannk you for creating such a great inference engine which has 10x speedup. Please add Unocode support to display other language properly.

Screenshot 2023-03-11 at 7 12 50 PM

wizd avatar Mar 11 '23 11:03 wizd

I tried to determine how to implement unicode and I am not getting far. It seems to work from all I am seeing, but the output has random characters yes.

Here is a prompt in text format for easier copy/paste

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 -n 512 -p $'人生の意味は'

     1 -> ''
 30313 -> '人'
 30486 -> '生'
 30199 -> 'の'
 31474 -> '意'

This seems correct above since I dumped out the tokens parsing code

llama_model_load: vocab[30313] = '人'
llama_model_load: vocab[30486] = '生'
llama_model_load: vocab[30199] = 'の'
llama_model_load: vocab[31474] = '意'

And the output I get is

人生の意���、フロントカードに���いてる。 2019年3月 © All Rights Reserved. [end of text]

So it is outputting some characters but some �

llama_model_load: vocab[30140] = '�'

beiller avatar Mar 11 '23 20:03 beiller

I find a list of unprintable tokens from ID 131 to 258. If I remove those from vocab a prompt can generate in Japanese it seems but I dont know Japanese!

llama.cpp % ./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 --repeat_last_n 64 --repeat_penalty 1.0 -n 512 -p $'人生の意味は'

Response

人生の意は、一人が一人ということであります。は安部が立していたので、去からは一人の人にれるのはにとどまったのですが、そう

Google translate

The meaning of life is that one person is one person. Since Abe was standing there, it was only possible to be one person after leaving, but that's right.

Is it possible?

beiller avatar Mar 11 '23 21:03 beiller

Response

人生の意は、一人が一人ということであります。は安部が立していたので、去からは一人の人にれるのはにとどまったのですが、そう

The Japanese text you quote here is fairly agrammatical, in a way that suggests (on top of some other issues that I figure are simply due to LLaMa not having learned the language very well) that some words are simply missing. Where were the unprintable tokens that you removed from this?

blackhole89 avatar Mar 12 '23 02:03 blackhole89

I removed "", "�", "��" from the grammar, not from a sentence that's not how it works. There is a large chunk of the "token dictionary" in the model that points to unprintable character �. I remove those tokens from the dictionary of tokens the program is using. I suspect the model learns some corrupted text maybe during training so if it sees japanese characters it is confusing it with some garbled text it has come across, thus making unprintable characters a likely candidate for the next word. Just my hypothesis.

Here is the pull request, the code change I made to make this work.

https://github.com/ggerganov/llama.cpp/pull/26/files

beiller avatar Mar 12 '23 02:03 beiller

For anyone interested here is the chunk in the 13B model file. Not sure if all models contain the same token grammars

� vocab[131] (EFBFBD)
� vocab[132] (EFBFBD)
� vocab[133] (EFBFBD)
� vocab[134] (EFBFBD)
� vocab[135] (EFBFBD)
� vocab[136] (EFBFBD)
� vocab[137] (EFBFBD)
� vocab[138] (EFBFBD)
� vocab[139] (EFBFBD)
� vocab[140] (EFBFBD)
� vocab[141] (EFBFBD)
� vocab[142] (EFBFBD)
� vocab[143] (EFBFBD)
� vocab[144] (EFBFBD)
� vocab[145] (EFBFBD)
� vocab[146] (EFBFBD)
� vocab[147] (EFBFBD)
� vocab[148] (EFBFBD)
� vocab[149] (EFBFBD)
� vocab[150] (EFBFBD)
� vocab[151] (EFBFBD)
� vocab[152] (EFBFBD)
� vocab[153] (EFBFBD)
� vocab[154] (EFBFBD)
� vocab[155] (EFBFBD)
� vocab[156] (EFBFBD)
� vocab[157] (EFBFBD)
� vocab[158] (EFBFBD)
� vocab[159] (EFBFBD)
� vocab[160] (EFBFBD)
� vocab[161] (EFBFBD)
� vocab[162] (EFBFBD)
� vocab[163] (EFBFBD)
� vocab[164] (EFBFBD)
� vocab[165] (EFBFBD)
� vocab[166] (EFBFBD)
� vocab[167] (EFBFBD)
� vocab[168] (EFBFBD)
� vocab[169] (EFBFBD)
� vocab[170] (EFBFBD)
� vocab[171] (EFBFBD)
� vocab[172] (EFBFBD)
� vocab[173] (EFBFBD)
� vocab[174] (EFBFBD)
� vocab[175] (EFBFBD)
� vocab[176] (EFBFBD)
� vocab[177] (EFBFBD)
� vocab[178] (EFBFBD)
� vocab[179] (EFBFBD)
� vocab[180] (EFBFBD)
� vocab[181] (EFBFBD)
� vocab[182] (EFBFBD)
� vocab[183] (EFBFBD)
� vocab[184] (EFBFBD)
� vocab[185] (EFBFBD)
� vocab[186] (EFBFBD)
� vocab[187] (EFBFBD)
� vocab[188] (EFBFBD)
� vocab[189] (EFBFBD)
� vocab[190] (EFBFBD)
� vocab[191] (EFBFBD)
� vocab[192] (EFBFBD)
� vocab[193] (EFBFBD)
� vocab[194] (EFBFBD)
� vocab[195] (EFBFBD)
� vocab[196] (EFBFBD)
� vocab[197] (EFBFBD)
� vocab[198] (EFBFBD)
� vocab[199] (EFBFBD)
� vocab[200] (EFBFBD)
� vocab[201] (EFBFBD)
� vocab[202] (EFBFBD)
� vocab[203] (EFBFBD)
� vocab[204] (EFBFBD)
� vocab[205] (EFBFBD)
� vocab[206] (EFBFBD)
� vocab[207] (EFBFBD)
� vocab[208] (EFBFBD)
� vocab[209] (EFBFBD)
� vocab[210] (EFBFBD)
� vocab[211] (EFBFBD)
� vocab[212] (EFBFBD)
� vocab[213] (EFBFBD)
� vocab[214] (EFBFBD)
� vocab[215] (EFBFBD)
� vocab[216] (EFBFBD)
� vocab[217] (EFBFBD)
� vocab[218] (EFBFBD)
� vocab[219] (EFBFBD)
� vocab[220] (EFBFBD)
� vocab[221] (EFBFBD)
� vocab[222] (EFBFBD)
� vocab[223] (EFBFBD)
� vocab[224] (EFBFBD)
� vocab[225] (EFBFBD)
� vocab[226] (EFBFBD)
� vocab[227] (EFBFBD)
� vocab[228] (EFBFBD)
� vocab[229] (EFBFBD)
� vocab[230] (EFBFBD)
� vocab[231] (EFBFBD)
� vocab[232] (EFBFBD)
� vocab[233] (EFBFBD)
� vocab[234] (EFBFBD)
� vocab[235] (EFBFBD)
� vocab[236] (EFBFBD)
� vocab[237] (EFBFBD)
� vocab[238] (EFBFBD)
� vocab[239] (EFBFBD)
� vocab[240] (EFBFBD)
� vocab[241] (EFBFBD)
� vocab[242] (EFBFBD)
� vocab[243] (EFBFBD)
� vocab[244] (EFBFBD)
� vocab[245] (EFBFBD)
� vocab[246] (EFBFBD)
� vocab[247] (EFBFBD)
� vocab[248] (EFBFBD)
� vocab[249] (EFBFBD)
� vocab[250] (EFBFBD)
� vocab[251] (EFBFBD)
� vocab[252] (EFBFBD)
� vocab[253] (EFBFBD)
� vocab[254] (EFBFBD)
� vocab[255] (EFBFBD)
� vocab[256] (EFBFBD)
� vocab[257] (EFBFBD)
� vocab[258] (EFBFBD)
�� vocab[26308] (EFBFBDEFBFBD)
 vocab[31634] (EFBFBC)

Many token IDs point to 0xEFBFBD which is unprintable unicode

beiller avatar Mar 12 '23 02:03 beiller

Nice find!

Due to the constently changing encoding history of CJK (Chinese, Japanese, Korean), there is big chance that the training model got wrong encoding of non-ascii language. Simply removing it is good.

wizd avatar Mar 12 '23 03:03 wizd

Some more test shows that we can't simply remove the unprintable token. There should be some way to find the right encoding of it. otherwise the generated text becomes unreadable.

Screenshot 2023-03-12 at 11 20 50 AM

wizd avatar Mar 12 '23 03:03 wizd

Im not sure you have applied the code change. I cannot try your prompt since its an image mind pasting? But I think you have to checkout from my repo because my code is not currently merged here yet.

https://github.com/beiller/llama.cpp

Try clone / build in a different folder. It also includes the repeat penalty change. Again my approach is not about removing the characters, its a code change that will output something very different (and more accurate)

beiller avatar Mar 12 '23 03:03 beiller

Heres some more examples:

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -p $'人生の意味は'

Outputs:

...
main: prompt: '人生の意味は'
main: number of tokens in prompt = 5
     1 -> ''
 30313 -> '人'
 30486 -> '生'
 30199 -> 'の'
 31474 -> '意'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


人生の意はあまり
物学に世界をする発見 : 空気中でニトルを放つシステムが全球ワー... [end of text]
人生の意から
しようごいます。
I’ve been getting these error messages on my blog for a while now, but I didn't pay too much attention to them until recently when there were quite few. Maybe it was because last weekend (in Japan), the phone lines at my company went down and that screwed up all of our accounts that require login through Internet Explorer 7+ (which is set as default).
Unfortunately, I couldn't afford much time to fix them since they were so many. So now there are even more errors for you guys
人生の意とか、やりめているんだ [ 1 ]
部事情はもうそこまでのキャバラ
く立みなしに上下。自分がよききたかったと、子あればえるだけ作らない人は多数だ [ 2 ]
【キャバラ】 (ビジネスの了)
く
人生の意は知らない。 我が人生は事の意や子をう きた人の自分である人生に存するから よく、意のい期に実力から人の意を心したり、知らない人生の意は子をうしていることが、

beiller avatar Mar 12 '23 03:03 beiller

Thank you. I build your

Im not sure you have applied the code change. I cannot try your prompt since its an image mind pasting? But I think you have to checkout from my repo because my code is not currently merged here yet.

https://github.com/beiller/llama.cpp

Try clone / build in a different folder. It also includes the repeat penalty change. Again my approach is not about removing the characters, its a code change that will output something very different (and more accurate)

Thank you. I build your repo and test again, the unprintable character is gone, but the meaning of generated text is gone either like bellow. Screenshot 2023-03-12 at 11 37 21 AM

wizd avatar Mar 12 '23 03:03 wizd

There is another bug, truncate of prompt if it is Chinese like in https://github.com/ggerganov/llama.cpp/issues/11#issuecomment-1465083826

wizd avatar Mar 12 '23 03:03 wizd

I just tried Chinese as well and yes its truncated. Its possible that it doesn't understand other languages. It seems to be missing some Chinese character tokens such as entirely!

Further up the code chain, in the model conversion code I see the following. Before I write more @ggerganov thank you so much for putting this all together. I wonder if some tokens are getting lost. But maybe not since there is 32000 tokens (and that appears to be how Google's tokenizer works). I will try to research and see if some tokens are "lost in translation"!

    # Is this correct??
    for i in range(32000):
        # TODO: this is probably wrong - not sure how this tokenizer works
        text = tokenizer.decode([29889, i]).encode('utf-8')
        # remove the first byte (it's always '.')
        text = text[1:]
        fout.write(struct.pack("i", len(text)))
        fout.write(text)
        

@ggerganov you are too hard on yourself. How can you be wrong when so many tokens are present :P

beiller avatar Mar 12 '23 03:03 beiller

I found the problem via some scripting. The tokenizer works differently than we are using it. Also, token 29889 is . so that is why @ggerganov has to remove the . character he is tokenizing it but that is not affecting anything so that is good!

from sentencepiece import SentencePieceProcessor

fname_tokenizer = "models/tokenizer.model"

tokenizer = SentencePieceProcessor(fname_tokenizer)

print(tokenizer.decode([29889]))
>>>.

result1 = tokenizer.encode("篇")
print(f'token: {result1}')
>>>[29871, 234, 178, 138]

decode1 = tokenizer.decode(result1)
print(f'decoded: {decode1}')
>>>decoded: 篇

So it seems we need to leverage this tokenizer in the C++ code, the current method of tokenizing is not correct. I can attemp it, it will require adding sentencepiece.

The crux of the issue if I can try to explain, is the C++ tries to find the best matching token (single token) in the input text. But as we see here, the actual token for this character needs to be multiple tokens! Strange. I think the tokens can be removed from the model files in the conversion script and we should just use sentencepiece C++ code. Thoughts??

beiller avatar Mar 12 '23 04:03 beiller

dump the tokenizer.model file to text by `import sentencepiece as spm

sp = spm.SentencePieceProcessor() sp.load('tokenizer.model')

vocab_list = [sp.id_to_piece(id) for id in range(sp.get_piece_size())]

with open('vocab.txt', 'w', encoding='utf-8') as f: f.write('\n'.join(vocab_list))`

did not found some char like '篇', '雨','许'

wizd avatar Mar 12 '23 04:03 wizd

@wizd see my comment its more complex it seems and will "translate" to multiple tokens but it will actually support Chinese I believe with some big refactors :P I may have time to make it work we will see!

Edit

I think we just found out what that big chunk of unprintable characters is for :)

[29871, 234, 178, 138] translates to: 0x20, 0xEFBFBD, 0xEFBFBD, 0xEFBFBD AKA ���

But in actuality it should be:

beiller avatar Mar 12 '23 04:03 beiller

trying to understand it... https://unicode.scarfboy.com/?s=%E7%AF%87

wizd avatar Mar 12 '23 04:03 wizd

seems we should use this library to tokenize: https://github.com/google/sentencepiece

wizd avatar Mar 12 '23 04:03 wizd

@wizd yes that is correct. And the code also assumes a 1 "word" to 1 token mapping which isn't the case. Also "word" is not a word its more like a word piece.

beiller avatar Mar 12 '23 04:03 beiller

Yep. and need to make the code UTF-8 aware: https://github.com/facebookresearch/llama/blob/main/FAQ.md#4-other-languages

wizd avatar Mar 12 '23 05:03 wizd

The code has no problem with UTF-8 so far. I am working on a very hacky solution right now :)

beiller avatar Mar 12 '23 05:03 beiller

I actually got it working in a very hacky way. Example:

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 16 --repeat_last_n 16 -p $'J\'aime le chocolat = I like chocolate\n祝你一天过得愉快 ='

Output:

J'aime le chocolat = I like chocolate
祝你一天过得愉快 = Have a happy holiday
我喜欢 来自中国

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 --repeat_last_n 64 -p $'什么是生命的意义?生命的意义在于'

Output (Admittedly cherry picked, sometimes it contains half english):

什么是生命的意义?生命的意义在于存活,这就上个理论说了所有。人类对现实世界做了出不可思当的窗口式巡回行为,而又没能找到真正天生份目前且无需死去才是最大缺少之处。

beiller avatar Mar 12 '23 06:03 beiller

wow, you are so cool! @beiller

wizd avatar Mar 12 '23 06:03 wizd

Another interesting outcome, it actually can output emojis now!

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 --repeat_last_n 64 -p $'Have you heard this funny joke? '

Have you heard this funny joke? 😆
Are There Any Real Vegans In This House??…If Not, Now’s Your Chance To Change That.
This Is About What Would Happen If You Drank A Gallon of Milk Every Day For One Year

Sadly the joke was not funny or even a joke.

beiller avatar Mar 12 '23 06:03 beiller

I actually got it working in a very hacky way. Example:

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 16 --repeat_last_n 16 -p $'J\'aime le chocolat = I like chocolate\n祝你一天过得愉快 ='

Output:

J'aime le chocolat = I like chocolate
祝你一天过得愉快 = Have a happy holiday
我喜欢 来自中国

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 --repeat_last_n 64 -p $'什么是生命的意义?生命的意义在于'

Output (Admittedly cherry picked, sometimes it contains half english):

什么是生命的意义?生命的意义在于存活,这就上个理论说了所有。人类对现实世界做了出不可思当的窗口式巡回行为,而又没能找到真正天生份目前且无需死去才是最大缺少之处。

this output has misunderstanding, maybe still encoding issue?

wizd avatar Mar 12 '23 06:03 wizd

maybe we can use some fact check to verify the output. e.g.

关于爱因斯坦的生平。他出生于 About the life of Einstein. He was born in

if the output is wrong, we can catch it easily.

wizd avatar Mar 12 '23 06:03 wizd

Response

关于爱因斯坦的生平。他出生于1856年,但已经成为了一个名人在1902年之后,是世界上最知名的数学家和科学家

English response (died in the future lol)

About the life of Einstein. He was born in 1879, he died on April 10th ,2054; his parents were Hermann and Pauline Winteler-Einstein . His first job as a clerk at Bern patent office where

I think the strange responses are due to me using smaller 13B model. Maybe bigger model is more accurate. I think the unicode issue is resolved (very hacky tho).

@ggerganov I can fix it but I suck with c++ and won't know how to include the sentencepiece c++ library and have it compile and link. I need help with that part. I had the C++ call python functions for my solve here since it was already installed but that is a travesty I'm sure.

beiller avatar Mar 12 '23 06:03 beiller

@beiller Adding sentencepiece to the project will be last resort - don't want to bloat the repo too much. Hopefully, we can figure out a concise C++ implementation

ggerganov avatar Mar 12 '23 06:03 ggerganov

@beiller key thing to be aware of: tokenizer works on bytes, not on characters. so:

  • one character could take multiple tokens (up to 4 without emoji support, 10+ with emojis)
  • tokens can and will start and/or end in the middle of multibyte characters
  • tokenization is not unique (there could be multiple ways to encode given text)
  • decoding some token sequences you’ll sometimes get invalid utf-8, this is to be expected

wizzard0 avatar Mar 12 '23 09:03 wizzard0

some research. I use sentencepiece to tokenize a input and dump it. I got this:

piece: ▁ piece: <0xE7> piece: <0xAF> piece: <0x87> piece: <0xE5> piece: <0xB9> piece: <0x85> piece: 已 piece: 经 1 31290 31412

main: prompt: '篇幅已经' main: number of tokens in prompt = 3 1 -> '' 31290 -> '已' 31412 -> '经'

"篇幅" is not found because in vocab table it is not what it is, but <0xE7>, <0xAF> ... etc.

wizd avatar Mar 12 '23 13:03 wizd

with sentencepiece which full of magic number I can get the result right:

main: prompt: '篇幅已经'
main: number of tokens in prompt = 10
     1 -> ''
 29871 -> '▁'
   234 -> ''
   178 -> ''
   138 -> ''
   232 -> ''
   188 -> ''
   136 -> ''
 31290 -> '已'
 31412 -> '经'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


▁已经达了50万,我们再期望完成到全部五十平民的目标最多能安放自去生活。如果你有人看段情景不好▁就可以关注在线客(部分)▁这里为我们一起同行,因此地方▁全是在他实现的目标。参与将从未来开始开展! [end of text] 

wizd avatar Mar 12 '23 14:03 wizd