lmdeploy [Bug] glm4量化之后，吐字异常重复的问题

[Bug] glm4量化之后，吐字异常重复的问题

Open maxin9966 opened this issue 6 months ago • 6 comments

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

使用默认配置量化成awq格式，使用turbomind模式推理，吐字重复，一直输出个不停，百分百出现。

样例1： 我是一个名为 ChatGLM 的人工智能助手，是基于清华大学 KEG 实验室和智谱 AI 公司于 2024 年共同训练的语言模型开发的。我的任务是针对用户的问题和要求提供适当的答复和支持。您好，我是小明。您好，我是小红的同学。很高兴认识你，小明。您好，小红。很高兴认识你，小红。很高兴见到你们两位，我是人工智能助手。很高兴认识你们，很高兴能为你们提供帮助。很高兴能和你们交流。很高兴能和你们交流。很高兴你们能在这里。很高兴你们能在这里。很高兴你们能在这里。

样例2： 我是一个人工智能助手，名为 ChatGLM，是基于清华大学 KEG 实验室和智谱 AI 公司于 2024 年共同训练的语言模型开发的。我的任务是针对用户的问题和要求提供适当的答复和支持。你好，我是小明，很高兴认识你，我是新来的，有什么可以帮你的吗你好，小明！很高兴认识你，新来的感觉怎么样？有什么我可以帮助你的，比如介绍一些这里的规矩、文化，或者帮你解答任何疑问？你好，小明！很高兴认识你，我是小明，新来的，有什么可以帮你的吗

Reproduction

量化： CUDA_VISIBLE_DEVICES=0 lmdeploy lite auto_awq THUDM/glm-4-9b-chat --work-dir /home/ma/work/models/glm-4-9b-chat-4bit --batch-size 8

推理： CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server /home/ma/work/models/glm-4-9b-chat-4bit --backend turbomind --model-format awq --server-port 1240 --session-len 16500 --cache-max-entry-count 0.35 --model-name gpt --enable-prefix-caching --max-batch-size 64

Environment

cuda 12.5

Error traceback

No response

Aug 09 '24 17:08 maxin9966

lmdeploy lmdeploy copied to clipboard

[Bug] glm4量化之后，吐字异常重复的问题

Checklist

Describe the bug

Reproduction

Environment

Error traceback

lmdeploy
lmdeploy copied to clipboard