vllm
vllm copied to clipboard
[Bug]: Performance : slow inference for FP8 on L20 with 0.5.1(v0.5.0.post1 was fine)
Your current environment
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA L20
GPU 1: NVIDIA L20
GPU 2: NVIDIA L20
GPU 3: NVIDIA L20
GPU 4: NVIDIA L20
GPU 5: NVIDIA L20
GPU 6: NVIDIA L20
GPU 7: NVIDIA L20
Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] flashinfer==0.0.8+cu121torch2.3
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PIX PIX PIX SYS SYS SYS SYS SYS 0-23,48-71 0 N/A
GPU1 PIX X PIX PIX SYS SYS SYS SYS SYS 0-23,48-71 0 N/A
GPU2 PIX PIX X PIX SYS SYS SYS SYS SYS 0-23,48-71 0 N/A
GPU3 PIX PIX PIX X SYS SYS SYS SYS SYS 0-23,48-71 0 N/A
GPU4 SYS SYS SYS SYS X PIX PIX PIX SYS 24-47,72-95 1 N/A
GPU5 SYS SYS SYS SYS PIX X PIX PIX SYS 24-47,72-95 1 N/A
GPU6 SYS SYS SYS SYS PIX PIX X PIX SYS 24-47,72-95 1 N/A
GPU7 SYS SYS SYS SYS PIX PIX PIX X SYS 24-47,72-95 1 N/A
NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
🐛 Describe the bug
The inference speed of vllm0.5.1 with fp8 quantization on L20 is even slower than fp16.
qwen2-7b @ L20*1:
vllm version | quant | speed |
---|---|---|
0.5.0 | fp8 | 59.41 token/s |
0.5.0 | fp16 | 44.05 token/s |
0.5.1 | fp8 | 26.50 token/s |
0.5.1 | fp16 | 43.60 token/s |