TinyChatEngine Support to Tesla P100 GPU inference

Support to Tesla P100 GPU inference

Open songkq opened this issue 9 months ago • 5 comments

Hi, when I run TinyChatEngine with ./chat LLaMA2_7B_chat int4 on P100 GPU, It generates some bad results. Could you please give some advice for this issue?

Using model: LLaMA2_7B_chat
Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq
Loading model... Finished!
USER: hello
ASSISTANT:

 #
$  #

#" ⁇ $
  $!!$
       ⁇ "

"!!" #         !
$
         ! !    #


!
$	!$$
"##!
 ⁇  ⁇ 	$ ⁇

        $"!" ⁇ 	#

        #
"



        $ ⁇

#	 $
 "# ⁇  ⁇ ##
#!"!"
$!"!"!"

Inference latency, Total time: 7.9 s, 14.5 ms/token, 69.0 token/s, 548 tokens

Sep 14 '23 05:09 songkq

TinyChatEngine TinyChatEngine copied to clipboard

Support to Tesla P100 GPU inference

TinyChatEngine
TinyChatEngine copied to clipboard