llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

alaways "failed to tokenize string! "

Open w1103693423 opened this issue 1 year ago • 6 comments

failed to tokenize string!

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | failed to tokenize string!

main: prompt: ' china' main: number of tokens in prompt = 1 1 -> ''

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

曲ー! /S部ュース / KSHErsLAheLUE - THE NEW CH`,MEgeERSION IS HERE@ÿThis entry was вер in news on JuneSASSSASS8 by adminS [end of text]

w1103693423 avatar Mar 19 '23 11:03 w1103693423

Can you provide the command line and a checksum of the model file?

sw avatar Mar 19 '23 11:03 sw

same problem, ggml-model-q4_0.bin, md5sum is 919e4f8aee6ce4f3fbabb6cbcd7756db

Shimadaaaaa avatar Mar 20 '23 08:03 Shimadaaaaa

Can you provide the command line and a checksum of the model file?

./main -m ./models/7B/ggml-model-q4_0.bin -p "china" -n 512

checksum: md5sum ggml-model-q4_0.bin 919e4f8aee6ce4f3fbabb6cbcd7756db ggml-model-q4_0.bin 6efc8dab194ab59e49cd24be5574d85e consolidated.00.pth

w1103693423 avatar Mar 20 '23 10:03 w1103693423

The files look good, though these are the "old" format, you'll have to regenerate them if you update to latest master.

There should be three tokens recognized with the old tokenizer:

main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
 18558 -> ' chi'
  1056 -> 'na'

The new tokenizer gives different tokens:

main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
   521 -> ' ch'
  1099 -> 'ina'

I really can't explain this, unless you have some strange terminal encoding set?

sw avatar Mar 20 '23 19:03 sw

image encoding is LANG=en_US.UTF-8

w1103693423 avatar Mar 22 '23 06:03 w1103693423

The files look good, though these are the "old" format, you'll have to regenerate them if you update to latest master.

There should be three tokens recognized with the old tokenizer:

main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
 18558 -> ' chi'
  1056 -> 'na'

The new tokenizer gives different tokens:

main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
   521 -> ' ch'
  1099 -> 'ina'

I really can't explain this, unless you have some strange terminal encoding set?

Thank you very much, it is available after I upgraded python version to 3.9 and pulled the latest master code and redeployed it。 image

w1103693423 avatar Mar 22 '23 09:03 w1103693423

Possibly a duplicate of #113.

sw avatar Apr 07 '23 16:04 sw