TinyChatEngine
TinyChatEngine copied to clipboard
Support to Tesla P100 GPU inference
Hi, when I run TinyChatEngine with ./chat LLaMA2_7B_chat int4
on P100 GPU, It generates some bad results. Could you please give some advice for this issue?
Using model: LLaMA2_7B_chat
Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq
Loading model... Finished!
USER: hello
ASSISTANT:
#
$ #
#" ⁇ $
$!!$
⁇ "
"!!" # !
$
! ! #
!
$ !$$
"##!
⁇ ⁇ $ ⁇
$"!" ⁇ #
#
"
$ ⁇
# $
"# ⁇ ⁇ ##
#!"!"
$!"!"!"
Inference latency, Total time: 7.9 s, 14.5 ms/token, 69.0 token/s, 548 tokens