TinyChatEngine icon indicating copy to clipboard operation
TinyChatEngine copied to clipboard

Assistant spitting out non-readable characters on RTX 4060

Open zhefciad opened this issue 1 year ago • 2 comments

(TinyChatEngine) zhef@zhef:~/TinyChatEngine/llm$ make chat -j
CUDA is available!
src/Generate.cc src/LLaMATokenizer.cc src/OPTGenerate.cc src/OPTTokenizer.cc src/utils.cc src/nn_modules/Fp32OPTAttention.cc src/nn_modules/Fp32OPTDecoder.cc src/nn_modules/Fp32OPTDecoderLayer.cc src/nn_modules/Fp32OPTForCausalLM.cc src/nn_modules/Fp32llamaAttention.cc src/nn_modules/Fp32llamaDecoder.cc src/nn_modules/Fp32llamaDecoderLayer.cc src/nn_modules/Fp32llamaForCausalLM.cc src/nn_modules/Int4OPTAttention.cc src/nn_modules/Int4OPTDecoder.cc src/nn_modules/Int4OPTDecoderLayer.cc src/nn_modules/Int4OPTForCausalLM.cc src/nn_modules/Int8OPTAttention.cc src/nn_modules/Int8OPTDecoder.cc src/nn_modules/Int8OPTDecoderLayer.cc src/nn_modules/OPTForCausalLM.cc src/ops/BMM_F32T.cc src/ops/BMM_S8T_S8N_F32T.cc src/ops/BMM_S8T_S8N_S8T.cc src/ops/LayerNorm.cc src/ops/LayerNormQ.cc src/ops/LlamaRMSNorm.cc src/ops/RotaryPosEmb.cc src/ops/W8A8B8O8Linear.cc src/ops/W8A8B8O8LinearReLU.cc src/ops/W8A8BFP32OFP32Linear.cc src/ops/arg_max.cc src/ops/batch_add.cc src/ops/embedding.cc src/ops/linear.cc src/ops/softmax.cc ../kernels/matmul_imp.cc ../kernels/matmul_int4.cc ../kernels/matmul_int8.cc
../kernels/cuda/matmul_ref_fp32.cc ../kernels/cuda/matmul_ref_int8.cc
../kernels/cuda/gemv_cuda.cu ../kernels/cuda/matmul_int4.cu  src/nn_modules/cuda/Int4llamaAttention.cu src/nn_modules/cuda/Int4llamaDecoder.cu src/nn_modules/cuda/Int4llamaDecoderLayer.cu src/nn_modules/cuda/Int4llamaForCausalLM.cu src/nn_modules/cuda/LLaMAGenerate.cu src/nn_modules/cuda/utils.cu src/ops/cuda/BMM_F16T.cu src/ops/cuda/LlamaRMSNorm.cu src/ops/cuda/RotaryPosEmb.cu src/ops/cuda/batch_add.cu src/ops/cuda/embedding.cu src/ops/cuda/linear.cu src/ops/cuda/softmax.cu
make: 'chat' is up to date.
(TinyChatEngine) zhef@zhef:~/TinyChatEngine/llm$ ./chat
TinyChatEngine by MIT HAN Lab: https://github.com/mit-han-lab/TinyChatEngine
Using model: LLaMA2_7B_chat
Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq
Loading model... Finished!
USER: Hi, I'm Jeff!
ASSISTANT:

 #
$  ⸮#

#" ⁇ $
   $!!$
        ⁇ "

"!!" #         !
$
         ! !    #


!⸮
$       !$$
"##!
 ⁇ ⸮ ⁇  $ ⁇

        $"!" ⁇  #

        ⸮#
"


⸮
        $ ⁇

#        $
 "# ⁇  ⁇ ##
⸮#!"!"
$!"!" !"

Inference latency, Total time: 40.5 s, 73.9 ms/token, 13.5 token/s, 548 tokens
USER:

I have an RTX 4060 Windows Laptop and ran this with WSL Ubuntu. Modified the Makefile to match my computing capability (89). Anything I did wrong or it's still not supported?

zhefciad avatar Oct 26 '23 00:10 zhefciad

GTX 1070 and same issue

Screenshot from 2023-11-10 17-43-26

dt1729 avatar Nov 11 '23 00:11 dt1729

@zhefciad I am also facing the same issue on a RTX 4090, is there any work around for this issue?

twk10 avatar Jul 16 '24 08:07 twk10

I meet the same problem, have you solved it?

LebinLiang avatar Oct 07 '24 05:10 LebinLiang

no just use ollama

zhefciad avatar Oct 08 '24 14:10 zhefciad