llama.cpp alaways "failed to tokenize string! "

failed to tokenize string!

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | failed to tokenize string!

main: prompt: ' china' main: number of tokens in prompt = 1 1 -> ''

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

曲ー！ /S部ュース / KSHErsLAheLUE - THE NEW CH`,MEgeERSION IS HERE@ÿThis entry was вер in news on JuneSASSSASS8 by adminS [end of text]

Mar 19 '23 11:03 w1103693423

Can you provide the command line and a checksum of the model file?

Mar 19 '23 11:03 sw

same problem, ggml-model-q4_0.bin, md5sum is 919e4f8aee6ce4f3fbabb6cbcd7756db

Mar 20 '23 08:03 Shimadaaaaa

Can you provide the command line and a checksum of the model file?

./main -m ./models/7B/ggml-model-q4_0.bin -p "china" -n 512

checksum: md5sum ggml-model-q4_0.bin 919e4f8aee6ce4f3fbabb6cbcd7756db ggml-model-q4_0.bin 6efc8dab194ab59e49cd24be5574d85e consolidated.00.pth

Mar 20 '23 10:03 w1103693423

The files look good, though these are the "old" format, you'll have to regenerate them if you update to latest master.

There should be three tokens recognized with the old tokenizer:

main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
 18558 -> ' chi'
  1056 -> 'na'

The new tokenizer gives different tokens:

main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
   521 -> ' ch'
  1099 -> 'ina'

I really can't explain this, unless you have some strange terminal encoding set?

Mar 20 '23 19:03 sw

encoding is LANG=en_US.UTF-8

Mar 22 '23 06:03 w1103693423

The files look good, though these are the "old" format, you'll have to regenerate them if you update to latest master.

There should be three tokens recognized with the old tokenizer:
main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
 18558 -> ' chi'
  1056 -> 'na'
The new tokenizer gives different tokens:
main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
   521 -> ' ch'
  1099 -> 'ina'
I really can't explain this, unless you have some strange terminal encoding set?

Thank you very much, it is available after I upgraded python version to 3.9 and pulled the latest master code and redeployed it。

Mar 22 '23 09:03 w1103693423

Possibly a duplicate of #113.

Apr 07 '23 16:04 sw

llama.cpp llama.cpp copied to clipboard

alaways "failed to tokenize string! "

llama.cpp
llama.cpp copied to clipboard