Qwen2.5
Qwen2.5 copied to clipboard
[Bug]: 在 4 卡 16GB V100 机器上采用 lmdeploy 部署 qwen2.5-32b-instruct-gptq-int4 模型,最高输出速度只有 80token/s ,请问这个速度正常吗?
Model Series
Qwen2.5
What are the models used?
qwen2.5-32b-instruct-gptq-int4 qwen2.5-32b-instruct-gptq-int8
What is the scenario where the problem happened?
deployment with lmdeploy and vllm
Is this a known issue?
- [X] I have followed the GitHub README.
- [X] I have checked the Qwen documentation and cannot find an answer there.
- [X] I have checked the documentation of the related framework and cannot find useful information.
- [X] I have searched the issues and there is not a similar one.
Information about environment
lmdeploy
Package Version
------------------------- -----------
accelerate 1.0.1
addict 2.4.0
aiofiles 23.2.1
aiohappyeyeballs 2.4.3
aiohttp 3.10.10
aiosignal 1.3.1
airportsdata 20241001
annotated-types 0.7.0
anyio 4.6.2.post1
async-timeout 4.0.3
attrs 24.2.0
certifi 2024.8.30
charset-normalizer 3.4.0
click 8.1.7
cloudpickle 3.1.0
datasets 3.0.1
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
einops 0.8.0
exceptiongroup 1.2.2
fastapi 0.115.2
ffmpy 0.4.0
filelock 3.16.1
fire 0.7.0
frozenlist 1.4.1
fsspec 2024.6.1
gradio 5.1.0
gradio_client 1.4.0
grpcio 1.66.2
h11 0.14.0
httpcore 1.0.6
httpx 0.27.2
huggingface-hub 0.25.2
idna 3.10
importlib_metadata 8.5.0
interegular 0.3.3
Jinja2 3.1.4
jiter 0.6.1
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
lark 1.2.2
lmdeploy 0.6.1
markdown-it-py 3.0.0
MarkupSafe 2.1.5
mdurl 0.1.2
mmengine-lite 0.10.5
mpmath 1.3.0
multidict 6.1.0
multiprocess 0.70.16
nest-asyncio 1.6.0
networkx 3.4.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.6.77
nvidia-nvtx-cu12 12.1.105
openai 1.51.2
orjson 3.10.7
outlines 0.1.0
outlines_core 0.1.0
packaging 24.1
pandas 2.2.3
peft 0.11.1
pillow 10.4.0
pip 24.2
platformdirs 4.3.6
propcache 0.2.0
protobuf 4.25.5
psutil 6.0.0
pyarrow 17.0.0
pycountry 24.6.1
pydantic 2.9.2
pydantic_core 2.23.4
pydub 0.25.1
Pygments 2.18.0
pynvml 11.5.3
python-dateutil 2.9.0.post0
python-multipart 0.0.12
python-rapidjson 1.20
pytz 2024.2
PyYAML 6.0.2
referencing 0.35.1
regex 2024.9.11
requests 2.32.3
rich 13.9.2
rpds-py 0.20.0
ruff 0.6.9
safetensors 0.4.5
semantic-version 2.10.0
sentencepiece 0.2.0
setuptools 75.1.0
shellingham 1.5.4
shortuuid 1.0.13
six 1.16.0
sniffio 1.3.1
starlette 0.40.0
sympy 1.13.3
termcolor 2.5.0
tiktoken 0.8.0
tokenizers 0.20.1
tomli 2.0.2
tomlkit 0.12.0
torch 2.3.1
torchvision 0.18.1
tqdm 4.66.5
transformers 4.45.2
triton 2.3.1
tritonclient 2.50.0
typer 0.12.5
typing_extensions 4.12.2
tzdata 2024.2
urllib3 2.2.3
uvicorn 0.31.1
websockets 12.0
wheel 0.44.0
xxhash 3.5.0
yapf 0.40.2
yarl 1.15.2
zipp 3.20.2
vllm
Package Version
--------------------------------- -------------
aiohappyeyeballs 2.4.3
aiohttp 3.10.10
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.6.2.post1
async-timeout 4.0.3
attrs 24.2.0
certifi 2024.8.30
charset-normalizer 3.4.0
click 8.1.7
cloudpickle 3.1.0
datasets 3.0.1
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
einops 0.8.0
exceptiongroup 1.2.2
fastapi 0.115.2
filelock 3.16.1
frozenlist 1.4.1
fsspec 2024.6.1
gguf 0.10.0
h11 0.14.0
httpcore 1.0.6
httptools 0.6.2
httpx 0.27.2
huggingface-hub 0.25.2
idna 3.10
importlib_metadata 8.5.0
interegular 0.3.3
Jinja2 3.1.4
jiter 0.6.1
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
lark 1.2.2
llvmlite 0.43.0
lm-format-enforcer 0.10.6
MarkupSafe 3.0.1
mistral_common 1.4.4
mpmath 1.3.0
msgpack 1.1.0
msgspec 0.18.6
multidict 6.1.0
multiprocess 0.70.16
nest-asyncio 1.6.0
networkx 3.4.1
numba 0.60.0
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 12.560.30
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.6.77
nvidia-nvtx-cu12 12.1.105
openai 1.51.2
opencv-python-headless 4.10.0.84
outlines 0.0.46
packaging 24.1
pandas 2.2.3
partial-json-parser 0.2.1.1.post4
pillow 10.4.0
pip 24.2
prometheus_client 0.21.0
prometheus-fastapi-instrumentator 7.0.0
propcache 0.2.0
protobuf 5.28.2
psutil 6.0.0
py-cpuinfo 9.0.0
pyairports 2.1.1
pyarrow 17.0.0
pycountry 24.6.1
pydantic 2.9.2
pydantic_core 2.23.4
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
pytz 2024.2
PyYAML 6.0.2
pyzmq 26.2.0
ray 2.37.0
referencing 0.35.1
regex 2024.9.11
requests 2.32.3
rpds-py 0.20.0
safetensors 0.4.5
sentencepiece 0.2.0
setuptools 75.1.0
six 1.16.0
sniffio 1.3.1
starlette 0.40.0
sympy 1.13.3
tiktoken 0.7.0
tokenizers 0.20.1
torch 2.4.0
torchvision 0.19.0
tqdm 4.66.5
transformers 4.45.2
triton 3.0.0
typing_extensions 4.12.2
tzdata 2024.2
urllib3 2.2.3
uvicorn 0.31.1
uvloop 0.21.0
vllm 0.6.3
watchfiles 0.24.0
websockets 13.1
wheel 0.44.0
xformers 0.0.27.post2
xxhash 3.5.0
yarl 1.15.2
zipp 3.20.2
Log output
使用 vllm 运行 qwen2.5-32b-instruct-gptq-int8
INFO 10-18 03:33:40 engine.py:292] Added request chat-16c0f2740e8044d986b16ae0b68a6c7e.
INFO 10-18 03:33:41 metrics.py:345] Avg prompt throughput: 7.3 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 10-18 03:33:46 metrics.py:345] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 52.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%.
INFO: 10.126.126.1:63229 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 10-18 03:34:00 metrics.py:345] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 14.6 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
Description
在 4 卡 16GB V100 机器上采用 lmdeploy 部署 qwen2.5-32b-instruct-gptq-int4 模型,最高输出速度只有 80token/s ,采用的命令如下:
lmdeploy serve api_server ./Qwen2.5-32B-Instruct-GPTQ-Int4 --model-format gptq --tp 4 --quant-policy 8
另外,使用最新的 vllm 在相同的机器上运行 Qwen2.5-32B-Instruct-GPTQ-Int8 ,最高输出速度只有 50 token/s 。