vllm [Performance]: Qwen 7b chat model, under 128 concurrency, the CPU utilization rate is 100%, and the GPU SM utilization rate is only about 60%-75%. Is it a CPU bottleneck?

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance

I am using vllm to deploy the qwen 7b chat model service. In a very high concurrency scenario, such as 128 concurrency, I found that the CPU utilization reached 100%, but I saw the GPU utilization rate is less than 60%

My question is, because a lot of vllm's scheduling and calculation logic is implemented by Python coroutines, it can only use the computing power of a single CPU. In a scenario like this with 128 concurrency, is the CPU becoming a computing bottleneck, causing GPU CUDA to be unable to achieve higher performance?

Model download address：https://huggingface.co/Qwen/Qwen-7B-Chat/tree/main

For sever scenario
For offline batch inference scenario

import random
import json
from vllm import LLM, SamplingParams

conc = 128
jsonl_path = "xxx.jsonl"

# 从jsonl文件中读取concurrent条数据
all_prompts = []
with open(jsonl_path, "r") as f:
    for line in f:
        line_obj = json.loads(line)
        print("line_obj as: ", line_obj)
        try:
            prompt = line_obj[-1]["content"]
        except Exception as e:
            prompt = line_obj[-1]["Content"]

        all_prompts.append(prompt)

# Sample prompts.
if len(all_prompts) > conc:
    prompts = all_prompts[:conc]
else:
    prompts = random.choices(all_prompts, k=conc)

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=500)

# Create an LLM.
#llm = LLM(model="facebook/opt-125m")
# llama2 7b chat
llm = LLM(model="/models/models--Qwen--Qwen-7B-Chat-new", trust_remote_code=True)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Your current environment (if you think it is necessary)

Collecting environment information...
PyTorch version: 2.2.2+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.27

Python version: 3.9.16 (main, May 15 2023, 23:46:34)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.83.1.el7.x86_64-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.6.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              14
On-line CPU(s) list: 0-13
Thread(s) per core:  2
Core(s) per socket:  7
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               106
Model name:          Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
Stepping:            6
CPU MHz:             2593.904
BogoMIPS:            5187.80
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           48K
L1i cache:           32K
L2 cache:            1280K
L3 cache:            49152K
NUMA node0 CPU(s):   0-13
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq spec_ctrl

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu11==2.19.3
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] torch==2.2.2+cu118
[pip3] triton==2.2.0
[pip3] vllm-nccl-cu11==2.18.1.0.4.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu11          2.19.3                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.19.3                   pypi_0    pypi
[conda] torch                     2.2.2+cu118              pypi_0    pypi
[conda] triton                    2.2.0                    pypi_0    pypi
[conda] vllm-nccl-cu11            2.18.1.0.4.0             pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	0-13	0		N/A
NIC0	SYS	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

May 14 '24 07:05 markluofd

You can start the VLLM API interface service, which will have CPU and GPU utilization, for example Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0% KV Cache will occupy GPU first, then CPU, can use FP8 E4M3 KV Cache reduce KV Cache utilization

May 15 '24 10:05 blacker521

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Oct 27 '24 02:10 github-actions[bot]

vllm vllm copied to clipboard

[Performance]: Qwen 7b chat model, under 128 concurrency, the CPU utilization rate is 100%, and the GPU SM utilization rate is only about 60%-75%. Is it a CPU bottleneck?

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

vllm
vllm copied to clipboard