wizd comments

Results 37 comments of


                                            wizd

Unicode support

There is another bug, truncate of prompt if it is Chinese like in https://github.com/ggerganov/llama.cpp/issues/11#issuecomment-1465083826

dump the tokenizer.model file to text by `import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.load('tokenizer.model') vocab_list = [sp.id_to_piece(id) for id in range(sp.get_piece_size())] with open('vocab.txt', 'w', encoding='utf-8') as f: f.write('\n'.join(vocab_list))` did...

Unicode support

> trying to understand it... https://unicode.scarfboy.com/?s=%E7%AF%87

Unicode support

seems we should use this library to tokenize: https://github.com/google/sentencepiece

Unicode support

Yep. and need to make the code UTF-8 aware: https://github.com/facebookresearch/llama/blob/main/FAQ.md#4-other-languages

Unicode support

wow, you are so cool! @beiller

Unicode support

> I actually got it working in a very hacky way. Example: > > `./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 16 --repeat_last_n 16 -p $'J\'aime le chocolat = I like...

Unicode support

maybe we can use some fact check to verify the output. e.g. 关于爱因斯坦的生平。他出生于 About the life of Einstein. He was born in if the output is wrong, we can catch...

Unicode support

some research. I use sentencepiece to tokenize a input and dump it. I got this: piece: ▁ piece: piece: piece: piece: piece: piece: piece: 已 piece: 经 1 31290 31412...

Unicode support

with sentencepiece which full of magic number I can get the result right: main: prompt: '篇幅已经' main: number of tokens in prompt = 10 1 -> '< s>' 29871 ->...