Qwen 2.5 Quantization is slower than fp16 with vLLM
Similar to #645, I am getting worse performance and throughput with the quantized version. I used the out of the box quantization example with the basic vLLM script. This is true for the 7B and 14B.
I am using vLLM and see roughly 1.8x slower throughput. When I run the benchmark script I see better performance with AWQ though:
Using the benchmarking script:
-- Loading model...
-- Warming up...
-- Generating 32 tokens, 32 in context...
** Speed (Prefill): 228.75 tokens/second
** Speed (Decode): 86.49 tokens/second
** Max Memory (device: 0): 5.40 GB (5.80%)
-- Loading model...
-- Warming up...
-- Generating 64 tokens, 64 in context...
** Speed (Prefill): 3486.72 tokens/second
** Speed (Decode): 86.38 tokens/second
** Max Memory (device: 0): 5.40 GB (5.80%)
-- Loading model...
-- Warming up...
-- Generating 128 tokens, 128 in context...
** Speed (Prefill): 4590.52 tokens/second
** Speed (Decode): 85.33 tokens/second
** Max Memory (device: 0): 5.41 GB (5.81%)
-- Loading model...
-- Warming up...
-- Generating 256 tokens, 256 in context...
** Speed (Prefill): 5008.78 tokens/second
** Speed (Decode): 85.19 tokens/second
** Max Memory (device: 0): 5.43 GB (5.83%)
-- Loading model...
-- Warming up...
-- Generating 512 tokens, 512 in context...
** Speed (Prefill): 5496.49 tokens/second
** Speed (Decode): 84.98 tokens/second
** Max Memory (device: 0): 5.54 GB (5.95%)
-- Loading model...
-- Warming up...
-- Generating 1024 tokens, 1024 in context...
** Speed (Prefill): 15427.16 tokens/second
** Speed (Decode): 84.86 tokens/second
** Max Memory (device: 0): 5.71 GB (6.13%)
-- Loading model...
-- Warming up...
-- Generating 2048 tokens, 2048 in context...
** Speed (Prefill): 18722.37 tokens/second
** Speed (Decode): 84.74 tokens/second
** Max Memory (device: 0): 6.14 GB (6.60%)
-- Loading model...
-- Warming up...
-- Generating 4096 tokens, 4096 in context...
** Speed (Prefill): 20145.56 tokens/second
** Speed (Decode): 84.65 tokens/second
** Max Memory (device: 0): 7.01 GB (7.52%)
Device: cuda:0
GPU: NVIDIA H100 NVL
Model: /home/oweller2/my_scratch/AutoAWQ/qwen-7b-awq/
Version: gemm
| Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------|
| 1 | 32 | 32 | 228.75 | 86.49 | 5.40 GB (5.80%) |
| 1 | 64 | 64 | 3486.72 | 86.38 | 5.40 GB (5.80%) |
| 1 | 128 | 128 | 4590.52 | 85.33 | 5.41 GB (5.81%) |
| 1 | 256 | 256 | 5008.78 | 85.19 | 5.43 GB (5.83%) |
| 1 | 512 | 512 | 5496.49 | 84.98 | 5.54 GB (5.95%) |
| 1 | 1024 | 1024 | 15427.2 | 84.86 | 5.71 GB (6.13%) |
| 1 | 2048 | 2048 | 18722.4 | 84.74 | 6.14 GB (6.60%) |
| 1 | 4096 | 4096 | 20145.6 | 84.65 | 7.01 GB (7.52%) |
vs non-quantized:
-- Loading model...
-- Warming up...
-- Generating 32 tokens, 32 in context...
** Speed (Prefill): 236.07 tokens/second
** Speed (Decode): 72.46 tokens/second
** Max Memory (device: 0): 14.38 GB (15.45%)
-- Loading model...
-- Warming up...
-- Generating 64 tokens, 64 in context...
** Speed (Prefill): 3610.96 tokens/second
** Speed (Decode): 72.52 tokens/second
** Max Memory (device: 0): 14.38 GB (15.45%)
-- Loading model...
-- Warming up...
-- Generating 128 tokens, 128 in context...
** Speed (Prefill): 7661.59 tokens/second
** Speed (Decode): 72.35 tokens/second
** Max Memory (device: 0): 14.38 GB (15.45%)
-- Loading model...
-- Warming up...
-- Generating 256 tokens, 256 in context...
** Speed (Prefill): 13484.31 tokens/second
** Speed (Decode): 72.53 tokens/second
** Max Memory (device: 0): 14.38 GB (15.45%)
-- Loading model...
-- Warming up...
-- Generating 512 tokens, 512 in context...
** Speed (Prefill): 20993.46 tokens/second
** Speed (Decode): 72.07 tokens/second
** Max Memory (device: 0): 14.43 GB (15.50%)
-- Loading model...
-- Warming up...
-- Generating 1024 tokens, 1024 in context...
** Speed (Prefill): 24013.15 tokens/second
** Speed (Decode): 72.44 tokens/second
** Max Memory (device: 0): 14.61 GB (15.69%)
-- Loading model...
-- Warming up...
-- Generating 2048 tokens, 2048 in context...
** Speed (Prefill): 22595.46 tokens/second
** Speed (Decode): 72.41 tokens/second
** Max Memory (device: 0): 14.97 GB (16.07%)
-- Loading model...
-- Warming up...
-- Generating 4096 tokens, 4096 in context...
** Speed (Prefill): 24222.07 tokens/second
** Speed (Decode): 72.35 tokens/second
** Max Memory (device: 0): 15.67 GB (16.82%)
Device: cuda:0
GPU: NVIDIA H100 NVL
Model: /home/oweller2/my_scratch/AutoAWQ/qwen-7b-custom
Version: FP16
| Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:------------------|
| 1 | 32 | 32 | 236.07 | 72.46 | 14.38 GB (15.45%) |
| 1 | 64 | 64 | 3610.96 | 72.52 | 14.38 GB (15.45%) |
| 1 | 128 | 128 | 7661.59 | 72.35 | 14.38 GB (15.45%) |
| 1 | 256 | 256 | 13484.3 | 72.53 | 14.38 GB (15.45%) |
| 1 | 512 | 512 | 20993.5 | 72.07 | 14.43 GB (15.50%) |
| 1 | 1024 | 1024 | 24013.2 | 72.44 | 14.61 GB (15.69%) |
| 1 | 2048 | 2048 | 22595.5 | 72.41 | 14.97 GB (16.07%) |
| 1 | 4096 | 4096 | 24222.1 | 72.35 | 15.67 GB (16.82%) |
installed are:
vllm==0.7.2
autoawq==0.2.8
autoawq_kernels==0.0.9
with
self.sampling_params = SamplingParams(
temperature=0,
max_tokens=max_output_tokens,
logprobs=20,
skip_special_tokens=False
)
self.model = LLM(
model=model_name_or_path,
tensor_parallel_size=int(num_gpus),
trust_remote_code=True,
max_model_len=context_size,
gpu_memory_utilization=0.9,
quantization="AWQ",
dtype="float16"
)
Am I using vLLM in a bad way / do I need other packages for AWQ to work?
update: removing quantization="AWQ" (per this link) seems to speed it up, but still slower than FP16.
update: removing
quantization="AWQ"(per this link) seems to speed it up, but still slower than FP16.
I have the same problem.