ggml icon indicating copy to clipboard operation
ggml copied to clipboard

fix showing unknown token at gpt_tokenize

Open katsu560 opened this issue 10 months ago • 0 comments

As for current implementation, gpt_tokenize() shows each byte of multi bytes character if unknown token is existed, like below

test_gpt_tokenizer : 0 tests failed out of 0 tests.
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token ' '
main: number of tokens in prompt = 6
main: token[0] =   5619, 日本で
main: token[1] =   3300, 一番
main: token[2] =   1737, 高い
main: token[3] =  14218, 山は
main: token[4] =  37814, 何で
main: token[5] =  24250, すか

I fixed with stopping show each bytes as below.

test_gpt_tokenizer : 0 tests failed out of 0 tests.
gpt_tokenize: unknown token '?'
main: number of tokens in prompt = 6
main: token[0] =   5619, 日本で
main: token[1] =   3300, 一番
main: token[2] =   1737, 高い
main: token[3] =  14218, 山は
main: token[4] =  37814, 何で
main: token[5] =  24250, すか

please confirm this.

-- detail -- original: $ ./240407up/gpt-neox.org --repeat-last-n 256 --repeat-penalty 1.2 -m models/cyberagent/ggml-calm-1b-q4_0.bin -s 7654321 -p "日本で一番高い山は何ですか?" main: seed = 7654321 gpt_neox_model_load: loading model from 'models/cyberagent/ggml-calm-1b-q4_0.bin' - please wait ... gpt_neox_model_load: n_vocab = 52096 gpt_neox_model_load: n_ctx = 2048 gpt_neox_model_load: n_embd = 2048 gpt_neox_model_load: n_head = 16 gpt_neox_model_load: n_layer = 24 gpt_neox_model_load: n_rot = 128 gpt_neox_model_load: par_res = 0 gpt_neox_model_load: ftype = 2002 gpt_neox_model_load: qntvr = 2 gpt_neox_model_load: ggml ctx size = 1917.12 MB gpt_neox_model_load: memory_size = 384.00 MB, n_mem = 49152 gpt_neox_model_load: .................................... done gpt_neox_model_load: model size = 764.92 MB / num tensors = 292 extract_tests_from_file : No test file found. test_gpt_tokenizer : 0 tests failed out of 0 tests. gpt_tokenize: unknown token ' ' gpt_tokenize: unknown token ' ' gpt_tokenize: unknown token ' ' main: number of tokens in prompt = 6 main: token[0] = 5619, 日本で main: token[1] = 3300, 一番 main: token[2] = 1737, 高い main: token[3] = 14218, 山は main: token[4] = 37814, 何で main: token[5] = 24250, すか

日本で一番高い山は何ですか?」。そんな質問を何度か受けてきましたが、 ...

fixed: $ ./240407up/gpt-neox.mod --repeat-last-n 256 --repeat-penalty 1.2 -m models/cyberagent/ggml-calm-1b-q4_0.bin -s 7654321 -p "日本で一番高い山は何ですか?" main: seed = 7654321 gpt_neox_model_load: loading model from 'models/cyberagent/ggml-calm-1b-q4_0.bin' - please wait ... gpt_neox_model_load: n_vocab = 52096 gpt_neox_model_load: n_ctx = 2048 gpt_neox_model_load: n_embd = 2048 gpt_neox_model_load: n_head = 16 gpt_neox_model_load: n_layer = 24 gpt_neox_model_load: n_rot = 128 gpt_neox_model_load: par_res = 0 gpt_neox_model_load: ftype = 2002 gpt_neox_model_load: qntvr = 2 gpt_neox_model_load: ggml ctx size = 1917.12 MB gpt_neox_model_load: memory_size = 384.00 MB, n_mem = 49152 gpt_neox_model_load: .................................... done gpt_neox_model_load: model size = 764.92 MB / num tensors = 292 extract_tests_from_file : No test file found. test_gpt_tokenizer : 0 tests failed out of 0 tests. gpt_tokenize: unknown token '?' main: number of tokens in prompt = 6 main: token[0] = 5619, 日本で main: token[1] = 3300, 一番 main: token[2] = 1737, 高い main: token[3] = 14218, 山は main: token[4] = 37814, 何で main: token[5] = 24250, すか

日本で一番高い山は何ですか?」。そんな質問を何度か受けてきましたが、 ...

katsu560 avatar Apr 20 '24 10:04 katsu560