mnn-llm
mnn-llm copied to clipboard
llama3 model can not answer
when I run the llama3 mnn model
(py_llama) st@server03:~/mnn-llm$ ./build/cli_demo ./models/llama3/
model path is ./models/llama3/
### model name : Llama3_8b
The device support i8sdot:0, support fp16:0, support i8mm: 0
load tokenizer
load tokenizer Done
### disk embedding is 1
[ 10% ] load ./models/llama3//lm.mnn model ... Done!
[ 15% ] load ./models/llama3//block_0.mnn model ... Done!
[ 18% ] load ./models/llama3//block_1.mnn model ... Done!
[ 21% ] load ./models/llama3//block_2.mnn model ... Done!
[ 23% ] load ./models/llama3//block_3.mnn model ... Done!
[ 26% ] load ./models/llama3//block_4.mnn model ... Done!
[ 29% ] load ./models/llama3//block_5.mnn model ... Done!
[ 31% ] load ./models/llama3//block_6.mnn model ... Done!
[ 34% ] load ./models/llama3//block_7.mnn model ... Done!
[ 36% ] load ./models/llama3//block_8.mnn model ... Done!
[ 39% ] load ./models/llama3//block_9.mnn model ... Done!
[ 42% ] load ./models/llama3//block_10.mnn model ... Done!
[ 44% ] load ./models/llama3//block_11.mnn model ... Done!
[ 47% ] load ./models/llama3//block_12.mnn model ... Done!
[ 50% ] load ./models/llama3//block_13.mnn model ... Done!
[ 52% ] load ./models/llama3//block_14.mnn model ... Done!
[ 55% ] load ./models/llama3//block_15.mnn model ... Done!
[ 58% ] load ./models/llama3//block_16.mnn model ... Done!
[ 60% ] load ./models/llama3//block_17.mnn model ... Done!
[ 63% ] load ./models/llama3//block_18.mnn model ... Done!
[ 66% ] load ./models/llama3//block_19.mnn model ... Done!
[ 68% ] load ./models/llama3//block_20.mnn model ... Done!
[ 71% ] load ./models/llama3//block_21.mnn model ... Done!
[ 74% ] load ./models/llama3//block_22.mnn model ... Done!
[ 76% ] load ./models/llama3//block_23.mnn model ... Done!
[ 79% ] load ./models/llama3//block_24.mnn model ... Done!
[ 81% ] load ./models/llama3//block_25.mnn model ... Done!
[ 84% ] load ./models/llama3//block_26.mnn model ... Done!
[ 87% ] load ./models/llama3//block_27.mnn model ... Done!
[ 89% ] load ./models/llama3//block_28.mnn model ... Done!
[ 92% ] load ./models/llama3//block_29.mnn model ... Done!
[ 95% ] load ./models/llama3//block_30.mnn model ... Done!
[ 97% ] load ./models/llama3//block_31.mnn model ... Done!
then to ask it returns
Q: who are you
A: You're asking "who"?
#################################
total tokens num = 20
prompt tokens num = 13
output tokens num = 7
total time = 2.59 s
prefill time = 1.31 s
decode time = 1.28 s
total speed = 7.73 tok/s
prefill speed = 9.92 tok/s
decode speed = 5.48 tok/s
chat speed = 2.71 tok/s
##################################
Q:
A: You're asking "are"?
#################################
total tokens num = 39
prompt tokens num = 32
output tokens num = 7
total time = 4.21 s
prefill time = 2.81 s
decode time = 1.41 s
total speed = 9.26 tok/s
prefill speed = 11.40 tok/s
decode speed = 4.98 tok/s
chat speed = 1.66 tok/s
##################################
Q:
A: You're asking "you"?
#################################
total tokens num = 58
prompt tokens num = 51
output tokens num = 7
total time = 4.82 s
prefill time = 3.48 s
decode time = 1.34 s
total speed = 12.04 tok/s
prefill speed = 14.64 tok/s
decode speed = 5.24 tok/s
chat speed = 1.45 tok/s
##################################
Q: introduce Beijing
A: You're asking "introduce"?
#################################
total tokens num = 84
prompt tokens num = 76
output tokens num = 8
total time = 6.32 s
prefill time = 5.19 s
decode time = 1.14 s
total speed = 13.29 tok/s
prefill speed = 14.66 tok/s
decode speed = 7.04 tok/s
chat speed = 1.27 tok/s
##################################
Q:
A: You're asking "Beijing"?
#################################
total tokens num = 108
prompt tokens num = 100
output tokens num = 8
total time = 7.68 s
prefill time = 6.51 s
decode time = 1.17 s
total speed = 14.06 tok/s
prefill speed = 15.37 tok/s
decode speed = 6.81 tok/s
chat speed = 1.04 tok/s
##################################
Any solution? Thanks !!
When I use benchmark it can return correctly
[ 92% ] load ./models/llama3//block_29.mnn model ... Done!
[ 95% ] load ./models/llama3//block_30.mnn model ... Done!
[ 97% ] load ./models/llama3//block_31.mnn model ... Done!
prompt file is ./resource/prompt.txt
### warmup ... Done
It's great to chat with you! How are you doing today?
哈哈!我是 ChatGPT,一个人工智能语言模型!
I'm just an AI, I don't have access to real-time weather information. However, you can check the weather forecast online or on your local weather app to get an idea of the current weather conditions.
#################################
prompt tokens num = 54
decode tokens num = 77
prefill time = 3.85 s
decode time = 12.91 s
prefill speed = 14.02 tok/s
decode speed = 5.96 tok/s
##################################
It looks like llama3 only can response
with llm->response(prompts[i])
, not chat
with llm->chat()
?
@wangzhaode Do you have any suggestions, please!
Marking as stale. No activity in 30 days.