wizd
wizd
There is another bug, truncate of prompt if it is Chinese like in https://github.com/ggerganov/llama.cpp/issues/11#issuecomment-1465083826
dump the tokenizer.model file to text by `import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.load('tokenizer.model') vocab_list = [sp.id_to_piece(id) for id in range(sp.get_piece_size())] with open('vocab.txt', 'w', encoding='utf-8') as f: f.write('\n'.join(vocab_list))` did...
> trying to understand it... https://unicode.scarfboy.com/?s=%E7%AF%87
seems we should use this library to tokenize: https://github.com/google/sentencepiece
Yep. and need to make the code UTF-8 aware: https://github.com/facebookresearch/llama/blob/main/FAQ.md#4-other-languages
wow, you are so cool! @beiller
> I actually got it working in a very hacky way. Example: > > `./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 16 --repeat_last_n 16 -p $'J\'aime le chocolat = I like...
maybe we can use some fact check to verify the output. e.g. 关于爱因斯坦的生平。他出生于 About the life of Einstein. He was born in if the output is wrong, we can catch...
some research. I use sentencepiece to tokenize a input and dump it. I got this: piece: ▁ piece: piece: piece: piece: piece: piece: piece: 已 piece: 经 1 31290 31412...
with sentencepiece which full of magic number I can get the result right: main: prompt: '篇幅已经' main: number of tokens in prompt = 10 1 -> '< s>' 29871 ->...