lorax Performance issues on AWQ and Lora

Performance issues on AWQ and Lora

Open dumbPy opened this issue 5 months ago • 0 comments

System Info

docker image: ghcr.io/predibase/lorax:07addea because main image isn't working on latest drivers device: Nvidia A100 80GB models in use: meta-llama/Meta-Llama-3.1-8B-Instruct and hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 loras finetuned with LLaMA-Factory

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

When testing with a batch of 6 concurrent requests,

Base model meta-llama/Meta-Llama-3.1-8B-Instruct takes 17-20 ms/token i.e. ~ 55 tokens/sec
AWQ quantized model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 takes 23-27 ms/token ~ 48 tokens/sec
With 3 loras (2 request for each lora), the above base model takes 38-46 ms/token ~ 25 tokens/sec

Expected behavior

AWQ quantized model are slower than base models.:
I was expecting it to be at least faster than the base model, but it's rather slower. The memory footprint is smaller though. The base model took 20.2 GB to load while the AWQ model took 12.2GB (before the rest of the memory was mostly reserved). For the same models (base, awq quantied), the throughput on sglang (with cuda graph, radix and marlin_awq kernel) are 78 tokens/sec and 150 tokens/sec respectively, compared to 55 and 48 tokens/sec here.
Loras are almost twice as slow as the base model. : I was expecting it to be slower than base, but doing 30 tokens/sec (when base does 60 tokens/sec) on Nvidia A100 for Llama-3.1-8B-Instruct + lora was surprising. sglang also added lora support recently, but it doesn't support any optimizations and givens an abysmal 10 tokens/sec with loras.

Sep 18 '24 19:09 dumbPy

lorax lorax copied to clipboard

Performance issues on AWQ and Lora

System Info

Information

Tasks

Reproduction

Expected behavior

lorax
lorax copied to clipboard