text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Regression on EETQ quantized models

Open claudioMontanari opened this issue 1 year ago • 0 comments

System Info

I have reasons to believe that this https://github.com/huggingface/text-generation-inference/pull/1729 is causing a 2-3x performance regression on decoding stage when running EETQ quantized models on multiple shards with Cuda graphs enabled. Find below supporting experiments.

Note: I understand TGI built-in benchmarker is the preferred way to provide such results, I can follow up with that in case.

Hardware used:

NVIDIA-SMI 535.129.03  
Driver Version: 535.129.03 
CUDA Version: 12.2
[NVIDIA A100-SXM4-40GB | 400W |  40960MiB] x 8

Information

  • [X] Docker
  • [ ] The CLI directly

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

Experiment 1

TGI image: sha-c2fd35d (from https://github.com/huggingface/text-generation-inference/pull/1716 before Upgrade EETQ) Args: --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --quantize eetq --sharded true --num-shard 2 --disable-grammar-support Hardware: 2xA100 @40GB memory 50th Percentile of per-token decode latency: ~8ms Load: sending 1 request at a time at /generate with inputs 128|256|512 tokens and max output 32 tokens.

Experiment 2

TGI image: sha-6c2c44b (Upgrade EETQ https://github.com/huggingface/text-generation-inference/pull/1729) Args: --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --quantize eetq --sharded true --num-shard 2 --disable-grammar-support Hardware: 2xA100 @40GB memory 50th Percentile of per-token decode latency: ~25ms Load: sending 1 request at a time at /generate with inputs 128|256|512 tokens and max output 32 tokens.

Experiment 3

TGI image: 2.0.0 Args: --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --sharded true --num-shard 4 --disable-grammar-support Hardware: 4xA100 @40GB memory 50th Percentile of per-token decode latency: ~10ms Load: sending 1 request at a time at /generate with inputs 128|256|512 tokens and max output 32 tokens.

Expected behavior

Exp. 2 shows a ~3x regression in per-token decode latency wrt Exp. 1 which has the same configuration but a TGI image pre-EETQ upgrade. Exp 3. shows that if the model is not quantized the performance for per-token decode latency are 2.5x better.

Performance should be consistent when sharding an EETQ quantized model.

claudioMontanari avatar Apr 20 '24 20:04 claudioMontanari