Qwen2.5 [Bug]: 在 4 卡 16GB V100 机器上采用 lmdeploy 部署 qwen2.5-32b-instruct-gptq-int4 模型，最高输出速度只有 80token/s ，请问这个速度正常吗？

[Bug]: 在 4 卡 16GB V100 机器上采用 lmdeploy 部署 qwen2.5-32b-instruct-gptq-int4 模型，最高输出速度只有 80token/s ，请问这个速度正常吗？

Open SolomonLeon opened this issue 4 months ago • 3 comments

Model Series

Qwen2.5

What are the models used?

qwen2.5-32b-instruct-gptq-int4 qwen2.5-32b-instruct-gptq-int8

What is the scenario where the problem happened?

deployment with lmdeploy and vllm

Is this a known issue?

[X] I have followed the GitHub README.
[X] I have checked the Qwen documentation and cannot find an answer there.
[X] I have checked the documentation of the related framework and cannot find useful information.
[X] I have searched the issues and there is not a similar one.

Information about environment

lmdeploy

Package                   Version
------------------------- -----------
accelerate                1.0.1
addict                    2.4.0
aiofiles                  23.2.1
aiohappyeyeballs          2.4.3
aiohttp                   3.10.10
aiosignal                 1.3.1
airportsdata              20241001
annotated-types           0.7.0
anyio                     4.6.2.post1
async-timeout             4.0.3
attrs                     24.2.0
certifi                   2024.8.30
charset-normalizer        3.4.0
click                     8.1.7
cloudpickle               3.1.0
datasets                  3.0.1
dill                      0.3.8
diskcache                 5.6.3
distro                    1.9.0
einops                    0.8.0
exceptiongroup            1.2.2
fastapi                   0.115.2
ffmpy                     0.4.0
filelock                  3.16.1
fire                      0.7.0
frozenlist                1.4.1
fsspec                    2024.6.1
gradio                    5.1.0
gradio_client             1.4.0
grpcio                    1.66.2
h11                       0.14.0
httpcore                  1.0.6
httpx                     0.27.2
huggingface-hub           0.25.2
idna                      3.10
importlib_metadata        8.5.0
interegular               0.3.3
Jinja2                    3.1.4
jiter                     0.6.1
jsonschema                4.23.0
jsonschema-specifications 2024.10.1
lark                      1.2.2
lmdeploy                  0.6.1
markdown-it-py            3.0.0
MarkupSafe                2.1.5
mdurl                     0.1.2
mmengine-lite             0.10.5
mpmath                    1.3.0
multidict                 6.1.0
multiprocess              0.70.16
nest-asyncio              1.6.0
networkx                  3.4.1
numpy                     1.26.4
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu12         8.9.2.26
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu12      12.1.0.106
nvidia-nccl-cu12          2.20.5
nvidia-nvjitlink-cu12     12.6.77
nvidia-nvtx-cu12          12.1.105
openai                    1.51.2
orjson                    3.10.7
outlines                  0.1.0
outlines_core             0.1.0
packaging                 24.1
pandas                    2.2.3
peft                      0.11.1
pillow                    10.4.0
pip                       24.2
platformdirs              4.3.6
propcache                 0.2.0
protobuf                  4.25.5
psutil                    6.0.0
pyarrow                   17.0.0
pycountry                 24.6.1
pydantic                  2.9.2
pydantic_core             2.23.4
pydub                     0.25.1
Pygments                  2.18.0
pynvml                    11.5.3
python-dateutil           2.9.0.post0
python-multipart          0.0.12
python-rapidjson          1.20
pytz                      2024.2
PyYAML                    6.0.2
referencing               0.35.1
regex                     2024.9.11
requests                  2.32.3
rich                      13.9.2
rpds-py                   0.20.0
ruff                      0.6.9
safetensors               0.4.5
semantic-version          2.10.0
sentencepiece             0.2.0
setuptools                75.1.0
shellingham               1.5.4
shortuuid                 1.0.13
six                       1.16.0
sniffio                   1.3.1
starlette                 0.40.0
sympy                     1.13.3
termcolor                 2.5.0
tiktoken                  0.8.0
tokenizers                0.20.1
tomli                     2.0.2
tomlkit                   0.12.0
torch                     2.3.1
torchvision               0.18.1
tqdm                      4.66.5
transformers              4.45.2
triton                    2.3.1
tritonclient              2.50.0
typer                     0.12.5
typing_extensions         4.12.2
tzdata                    2024.2
urllib3                   2.2.3
uvicorn                   0.31.1
websockets                12.0
wheel                     0.44.0
xxhash                    3.5.0
yapf                      0.40.2
yarl                      1.15.2
zipp                      3.20.2

vllm

Package                           Version
--------------------------------- -------------
aiohappyeyeballs                  2.4.3
aiohttp                           3.10.10
aiosignal                         1.3.1
annotated-types                   0.7.0
anyio                             4.6.2.post1
async-timeout                     4.0.3
attrs                             24.2.0
certifi                           2024.8.30
charset-normalizer                3.4.0
click                             8.1.7
cloudpickle                       3.1.0
datasets                          3.0.1
dill                              0.3.8
diskcache                         5.6.3
distro                            1.9.0
einops                            0.8.0
exceptiongroup                    1.2.2
fastapi                           0.115.2
filelock                          3.16.1
frozenlist                        1.4.1
fsspec                            2024.6.1
gguf                              0.10.0
h11                               0.14.0
httpcore                          1.0.6
httptools                         0.6.2
httpx                             0.27.2
huggingface-hub                   0.25.2
idna                              3.10
importlib_metadata                8.5.0
interegular                       0.3.3
Jinja2                            3.1.4
jiter                             0.6.1
jsonschema                        4.23.0
jsonschema-specifications         2024.10.1
lark                              1.2.2
llvmlite                          0.43.0
lm-format-enforcer                0.10.6
MarkupSafe                        3.0.1
mistral_common                    1.4.4
mpmath                            1.3.0
msgpack                           1.1.0
msgspec                           0.18.6
multidict                         6.1.0
multiprocess                      0.70.16
nest-asyncio                      1.6.0
networkx                          3.4.1
numba                             0.60.0
numpy                             1.26.4
nvidia-cublas-cu12                12.1.3.1
nvidia-cuda-cupti-cu12            12.1.105
nvidia-cuda-nvrtc-cu12            12.1.105
nvidia-cuda-runtime-cu12          12.1.105
nvidia-cudnn-cu12                 9.1.0.70
nvidia-cufft-cu12                 11.0.2.54
nvidia-curand-cu12                10.3.2.106
nvidia-cusolver-cu12              11.4.5.107
nvidia-cusparse-cu12              12.1.0.106
nvidia-ml-py                      12.560.30
nvidia-nccl-cu12                  2.20.5
nvidia-nvjitlink-cu12             12.6.77
nvidia-nvtx-cu12                  12.1.105
openai                            1.51.2
opencv-python-headless            4.10.0.84
outlines                          0.0.46
packaging                         24.1
pandas                            2.2.3
partial-json-parser               0.2.1.1.post4
pillow                            10.4.0
pip                               24.2
prometheus_client                 0.21.0
prometheus-fastapi-instrumentator 7.0.0
propcache                         0.2.0
protobuf                          5.28.2
psutil                            6.0.0
py-cpuinfo                        9.0.0
pyairports                        2.1.1
pyarrow                           17.0.0
pycountry                         24.6.1
pydantic                          2.9.2
pydantic_core                     2.23.4
python-dateutil                   2.9.0.post0
python-dotenv                     1.0.1
pytz                              2024.2
PyYAML                            6.0.2
pyzmq                             26.2.0
ray                               2.37.0
referencing                       0.35.1
regex                             2024.9.11
requests                          2.32.3
rpds-py                           0.20.0
safetensors                       0.4.5
sentencepiece                     0.2.0
setuptools                        75.1.0
six                               1.16.0
sniffio                           1.3.1
starlette                         0.40.0
sympy                             1.13.3
tiktoken                          0.7.0
tokenizers                        0.20.1
torch                             2.4.0
torchvision                       0.19.0
tqdm                              4.66.5
transformers                      4.45.2
triton                            3.0.0
typing_extensions                 4.12.2
tzdata                            2024.2
urllib3                           2.2.3
uvicorn                           0.31.1
uvloop                            0.21.0
vllm                              0.6.3
watchfiles                        0.24.0
websockets                        13.1
wheel                             0.44.0
xformers                          0.0.27.post2
xxhash                            3.5.0
yarl                              1.15.2
zipp                              3.20.2

Log output

使用 vllm 运行 qwen2.5-32b-instruct-gptq-int8

INFO 10-18 03:33:40 engine.py:292] Added request chat-16c0f2740e8044d986b16ae0b68a6c7e.
INFO 10-18 03:33:41 metrics.py:345] Avg prompt throughput: 7.3 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 10-18 03:33:46 metrics.py:345] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 52.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%.
INFO:     10.126.126.1:63229 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 10-18 03:34:00 metrics.py:345] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 14.6 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Description

在 4 卡 16GB V100 机器上采用 lmdeploy 部署 qwen2.5-32b-instruct-gptq-int4 模型，最高输出速度只有 80token/s ，采用的命令如下：

lmdeploy serve api_server ./Qwen2.5-32B-Instruct-GPTQ-Int4 --model-format gptq --tp 4 --quant-policy 8

另外，使用最新的 vllm 在相同的机器上运行 Qwen2.5-32B-Instruct-GPTQ-Int8 ，最高输出速度只有 50 token/s 。

Oct 17 '24 19:10 SolomonLeon

Qwen2.5 Qwen2.5 copied to clipboard

[Bug]: 在 4 卡 16GB V100 机器上采用 lmdeploy 部署 qwen2.5-32b-instruct-gptq-int4 模型，最高输出速度只有 80token/s ，请问这个速度正常吗？

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this a known issue?

Information about environment

Log output

Description

Qwen2.5
Qwen2.5 copied to clipboard