lorax
lorax copied to clipboard
Performance issues on AWQ and Lora
System Info
docker image: ghcr.io/predibase/lorax:07addea
because main image isn't working on latest drivers
device: Nvidia A100 80GB
models in use: meta-llama/Meta-Llama-3.1-8B-Instruct
and hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
loras finetuned with LLaMA-Factory
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
When testing with a batch of 6 concurrent requests,
- Base model
meta-llama/Meta-Llama-3.1-8B-Instruct
takes 17-20 ms/token i.e. ~ 55 tokens/sec - AWQ quantized model
hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
takes 23-27 ms/token ~ 48 tokens/sec - With 3 loras (2 request for each lora), the above base model takes 38-46 ms/token ~ 25 tokens/sec
Expected behavior
-
AWQ quantized model are slower than base models.:
I was expecting it to be at least faster than the base model, but it's rather slower. The memory footprint is smaller though. The base model took 20.2 GB to load while the AWQ model took 12.2GB (before the rest of the memory was mostly reserved). For the same models (base, awq quantied), the throughput on sglang (with cuda graph, radix and marlin_awq kernel) are 78 tokens/sec and 150 tokens/sec respectively, compared to 55 and 48 tokens/sec here. -
Loras are almost twice as slow as the base model. : I was expecting it to be slower than base, but doing 30 tokens/sec (when base does 60 tokens/sec) on Nvidia A100 for Llama-3.1-8B-Instruct + lora was surprising. sglang also added lora support recently, but it doesn't support any optimizations and givens an abysmal 10 tokens/sec with loras.