TinyChatEngine
TinyChatEngine copied to clipboard
Assistant spitting out non-readable characters on RTX 4060
(TinyChatEngine) zhef@zhef:~/TinyChatEngine/llm$ make chat -j
CUDA is available!
src/Generate.cc src/LLaMATokenizer.cc src/OPTGenerate.cc src/OPTTokenizer.cc src/utils.cc src/nn_modules/Fp32OPTAttention.cc src/nn_modules/Fp32OPTDecoder.cc src/nn_modules/Fp32OPTDecoderLayer.cc src/nn_modules/Fp32OPTForCausalLM.cc src/nn_modules/Fp32llamaAttention.cc src/nn_modules/Fp32llamaDecoder.cc src/nn_modules/Fp32llamaDecoderLayer.cc src/nn_modules/Fp32llamaForCausalLM.cc src/nn_modules/Int4OPTAttention.cc src/nn_modules/Int4OPTDecoder.cc src/nn_modules/Int4OPTDecoderLayer.cc src/nn_modules/Int4OPTForCausalLM.cc src/nn_modules/Int8OPTAttention.cc src/nn_modules/Int8OPTDecoder.cc src/nn_modules/Int8OPTDecoderLayer.cc src/nn_modules/OPTForCausalLM.cc src/ops/BMM_F32T.cc src/ops/BMM_S8T_S8N_F32T.cc src/ops/BMM_S8T_S8N_S8T.cc src/ops/LayerNorm.cc src/ops/LayerNormQ.cc src/ops/LlamaRMSNorm.cc src/ops/RotaryPosEmb.cc src/ops/W8A8B8O8Linear.cc src/ops/W8A8B8O8LinearReLU.cc src/ops/W8A8BFP32OFP32Linear.cc src/ops/arg_max.cc src/ops/batch_add.cc src/ops/embedding.cc src/ops/linear.cc src/ops/softmax.cc ../kernels/matmul_imp.cc ../kernels/matmul_int4.cc ../kernels/matmul_int8.cc
../kernels/cuda/matmul_ref_fp32.cc ../kernels/cuda/matmul_ref_int8.cc
../kernels/cuda/gemv_cuda.cu ../kernels/cuda/matmul_int4.cu src/nn_modules/cuda/Int4llamaAttention.cu src/nn_modules/cuda/Int4llamaDecoder.cu src/nn_modules/cuda/Int4llamaDecoderLayer.cu src/nn_modules/cuda/Int4llamaForCausalLM.cu src/nn_modules/cuda/LLaMAGenerate.cu src/nn_modules/cuda/utils.cu src/ops/cuda/BMM_F16T.cu src/ops/cuda/LlamaRMSNorm.cu src/ops/cuda/RotaryPosEmb.cu src/ops/cuda/batch_add.cu src/ops/cuda/embedding.cu src/ops/cuda/linear.cu src/ops/cuda/softmax.cu
make: 'chat' is up to date.
(TinyChatEngine) zhef@zhef:~/TinyChatEngine/llm$ ./chat
TinyChatEngine by MIT HAN Lab: https://github.com/mit-han-lab/TinyChatEngine
Using model: LLaMA2_7B_chat
Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq
Loading model... Finished!
USER: Hi, I'm Jeff!
ASSISTANT:
#
$ ⸮#
#" ⁇ $
$!!$
⁇ "
"!!" # !
$
! ! #
!⸮
$ !$$
"##!
⁇ ⸮ ⁇ $ ⁇
$"!" ⁇ #
⸮#
"
⸮
$ ⁇
# $
"# ⁇ ⁇ ##
⸮#!"!"
$!"!" !"
Inference latency, Total time: 40.5 s, 73.9 ms/token, 13.5 token/s, 548 tokens
USER:
I have an RTX 4060 Windows Laptop and ran this with WSL Ubuntu. Modified the Makefile to match my computing capability (89). Anything I did wrong or it's still not supported?
GTX 1070 and same issue
@zhefciad I am also facing the same issue on a RTX 4090, is there any work around for this issue?
I meet the same problem, have you solved it?
no just use ollama