vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: Performance : slow inference for FP8 on L20 with 0.5.1(v0.5.0.post1 was fine)

Open garycaokai opened this issue 7 months ago • 6 comments

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L20
GPU 1: NVIDIA L20
GPU 2: NVIDIA L20
GPU 3: NVIDIA L20
GPU 4: NVIDIA L20
GPU 5: NVIDIA L20
GPU 6: NVIDIA L20
GPU 7: NVIDIA L20

Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] flashinfer==0.0.8+cu121torch2.3
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PIX	PIX	PIX	SYS	SYS	SYS	SYS	SYS	0-23,48-71	0		N/A
GPU1	PIX	 X 	PIX	PIX	SYS	SYS	SYS	SYS	SYS	0-23,48-71	0		N/A
GPU2	PIX	PIX	 X 	PIX	SYS	SYS	SYS	SYS	SYS	0-23,48-71	0		N/A
GPU3	PIX	PIX	PIX	 X 	SYS	SYS	SYS	SYS	SYS	0-23,48-71	0		N/A
GPU4	SYS	SYS	SYS	SYS	 X 	PIX	PIX	PIX	SYS	24-47,72-95	1		N/A
GPU5	SYS	SYS	SYS	SYS	PIX	 X 	PIX	PIX	SYS	24-47,72-95	1		N/A
GPU6	SYS	SYS	SYS	SYS	PIX	PIX	 X 	PIX	SYS	24-47,72-95	1		N/A
GPU7	SYS	SYS	SYS	SYS	PIX	PIX	PIX	 X 	SYS	24-47,72-95	1		N/A
NIC0	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0

🐛 Describe the bug

The inference speed of vllm0.5.1 with fp8 quantization on L20 is even slower than fp16.

qwen2-7b @ L20*1:

vllm version quant speed
0.5.0 fp8 59.41 token/s
0.5.0 fp16 44.05 token/s
0.5.1 fp8 26.50 token/s
0.5.1 fp16 43.60 token/s

garycaokai avatar Jul 09 '24 03:07 garycaokai