Regression on EETQ quantized models
System Info
I have reasons to believe that this https://github.com/huggingface/text-generation-inference/pull/1729 is causing a 2-3x performance regression on decoding stage when running EETQ quantized models on multiple shards with Cuda graphs enabled. Find below supporting experiments.
Note: I understand TGI built-in benchmarker is the preferred way to provide such results, I can follow up with that in case.
Hardware used:
NVIDIA-SMI 535.129.03
Driver Version: 535.129.03
CUDA Version: 12.2
[NVIDIA A100-SXM4-40GB | 400W | 40960MiB] x 8
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
Experiment 1
TGI image: sha-c2fd35d (from https://github.com/huggingface/text-generation-inference/pull/1716 before Upgrade EETQ)
Args:
--model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --quantize eetq --sharded true --num-shard 2 --disable-grammar-support
Hardware: 2xA100 @40GB memory
50th Percentile of per-token decode latency: ~8ms
Load: sending 1 request at a time at /generate with inputs 128|256|512 tokens and max output 32 tokens.
Experiment 2
TGI image: sha-6c2c44b (Upgrade EETQ https://github.com/huggingface/text-generation-inference/pull/1729)
Args:
--model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --quantize eetq --sharded true --num-shard 2 --disable-grammar-support
Hardware: 2xA100 @40GB memory
50th Percentile of per-token decode latency: ~25ms
Load: sending 1 request at a time at /generate with inputs 128|256|512 tokens and max output 32 tokens.
Experiment 3
TGI image: 2.0.0
Args:
--model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --sharded true --num-shard 4 --disable-grammar-support
Hardware: 4xA100 @40GB memory
50th Percentile of per-token decode latency: ~10ms
Load: sending 1 request at a time at /generate with inputs 128|256|512 tokens and max output 32 tokens.
Expected behavior
Exp. 2 shows a ~3x regression in per-token decode latency wrt Exp. 1 which has the same configuration but a TGI image pre-EETQ upgrade. Exp 3. shows that if the model is not quantized the performance for per-token decode latency are 2.5x better.
Performance should be consistent when sharding an EETQ quantized model.