text-generation-inference On-The-Fly Quantization for Inference appears not to be working as per documentation.

System Info

Platform: Dell 760xa with 4x L40S GPUs OS Description: Ubuntu 22.04.5 LTS GPU: NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 Python: 3.10.12 Docker: 26.1.5

Model: [Deploy Meta-Llama-3.1-8b-Instruct | Dell Enterprise Hub by Hugging Face] (https://dell.huggingface.co/authenticated/models/meta-llama/Meta-Llama-3.1-8b-Instruct/deploy/docker)

Tested with two versions of model containers: registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test -> TGI 2.4.0 registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct TGI -> 2.0.5.dev0

Information

[ ] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Deploy model from https://dell.huggingface.co/authenticated/models/meta-llama/Meta-Llama-3.1-8b-Instruct/deploy/docker
Run the following command(No Quantiziation) (Note) also tried variations see 6) below. docker run -it --shm-size 1g -p 80:80 --gpus 2 -e NUM_SHARD=2 -e MAX_BATCH_PREFILL_TOKENS=16182 -e MAX_INPUT_TOKENS=8000 -e MAX_TOTAL_TOKENS=8192 registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test
From the host where the model is deployed run the following command nvidia-smi note the gpu memory usage e.g 27629 GB for each GPU
Re Run step 2 this time with quantization specified. docker run -it --shm-size 1g -p 80:80 --gpus 2 -e NUM_SHARD=2 -e MAX_BATCH_PREFILL_TOKENS=16182 -e MAX_INPUT_TOKENS=8000 -e MAX_TOTAL_TOKENS=8192 registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test --quantize bitsandbytes
From the host where the model is deployed run the following command nvidia-smi note the gpu memory usage e.g 276701 GB for 1 GPU and 27027 for 2nd GPU
Expected on-the-fly memory quantization was tested for several Model Configurations combinations TGI Container Versions 2.4.0 & 2.0.5.dev0 (Current DEH version): registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test -> TGI 2.4.0 registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct TGI -> 2.0.5.dev0 Single and Dual GPU's docker run -it --shm-size 1g -p 80:80 --gpus 1 -e NUM_SHARD=1 docker run -it --shm-size 1g -p 80:80 --gpus 2 -e NUM_SHARD=2 Quantize Options: Quantize options. bitsandbytes, bitsandbytes-fp4, bitsandbytes-nf4, fp8, eetq

Expected behavior

See attached PDF HF-TICKET-Quantization-Results.pdf Results.

Running e.g Llama 8B instruct with --quantize bitsandbytes we see minor or insignificant differences in GPU memory utilization. Note: Both TGI Container Versions show similar signatures.

On-the-Fly quantization for Inferencing doesn't appear to be working as expected.

will cut the memory requirement in half, will cut the memory requirement by 4x

Does bitsandbytes-fp4 and bitsandbytes-nf4 work ? Does fp8 quantization work for on-the-fly inference quantization ? Does eetq quantization work for on-the-fly inference quantization ? Does on-the-fly quantization work with multi-GPU’s instances ? Should different Input-Token Configs be used to see meaningful quantization results? eg. { MAX_BATCH_PREFILL_TOKENS=16182, MAX_INPUT_TOKENS=8000, MAX_TOTAL_TOKENS=8192 }

Nov 15 '24 10:11 colin-byrneireland1

Could you check the size of the key-value cache in both cases? The memory freed up by quantization is used to increase the size of the key-value cache, so that more requests can be in flight simultaneously and prefix caching gives larger benefits. See e.g.:

❯ text-generation-launcher --model-id meta-llama/Llama-3.1-8B-Instruct --port 8080 | grep "KV-cache"
2024-11-15T14:01:33.338562Z  INFO text_generation_launcher: KV-cache blocks: 35481, size: 1
❯ text-generation-launcher --model-id meta-llama/Llama-3.1-8B-Instruct --port 8080 --quantize fp8 | grep "KV-cache"
2024-11-15T14:02:53.123154Z  INFO text_generation_launcher: KV-cache blocks: 72561, size: 1

Nov 15 '24 14:11 danieldk

@danieldk we launch the container via docker docker run ..... I don't see "KV-cache" in these logs. When I run text-generation-launcher from inside the running container I don't see "KE-cache" either on the console. Can you advise ?

Nov 15 '24 16:11 colin-byrneireland1

That's odd, the KV-cache size is logged unconditionally at the info level during warmup. It's only added in TGI 2.4.0, so the message wouldn't be logged in 2.0.5 (though the same applies, the additional memory is used to make a larger KV cache).

Example run with 2.4.0:

❯ model=teknium/OpenHermes-2.5-Mistral-7B docker run --rm --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data \
    ghcr.io/huggingface/text-generation-inference:2.4.0 \
    --model-id teknium/OpenHermes-2.5-Mistral-7B | grep KV-cache
2024-11-20T09:06:10.449760Z  INFO text_generation_launcher: KV-cache blocks: 52722, size: 1

Nov 20 '24 09:11 danieldk

Thanks @danieldk I verified "KV-cache" is present in the 2.4.0 TGI console logs but not previous versions. So I noticed that the KV-cache blocks: 123449, size: 1 entry appears to be the same figure for the Infeared MAX_BATCH_TOTAL_TOKENS parameter, ie KV-cache blocks: 123449 and MAX_BATCH_TOTAL_TOKENS = 123449. I'm looking at this HF blog and beginning to get the picture that the TGI inference engine is doing more optimization (PreFill & Decode) then we originally expected to see -->TGI launcher documentation.

Bitsandbytes 8bit. Can be applied on any model, will cut the memory requirement in half

With better understanding of the TGI memory optimizations for Inference we still have a few outstanding questions.

We note that Bitsandsbytes entry in 2.4.0 is to be deprecated and to use eetq instead. Can you confirm ?
The term on-the-fly quantization is this equivalent to the

"Continuous Batch Optimization" . or not related at all ?

Can you confirm with multipe GPU config, --gpus 2 -e NUM_SHARD=2 GPU memory should distributed across two GPUS for example.
Can you confirm if KV-cache blocks: and MAX_BATCH_TOTAL_TOKENS = are one and the same ? or how do they differ?

Nov 21 '24 11:11 colin-byrneireland1

@danieldk when tgi logs key-value cache and max batch total tokens how can I translate this back to memory. For example do these values relate to the free memory displayed in for example nvidia-smi --query-gpu=gpu_name,memory.used,memory.free,memory.total,utilization.memory --format=csv ?

eg: Output

name, memory.used [MiB], memory.free [MiB], memory.total [MiB], utilization.memory [%] NVIDIA L40S, 27629 MiB, 17961 MiB, 46068 MiB, 95 %

Is GPU used memory = to memory to load model and Free memory the KV-Cache availability ?

Dec 04 '24 17:12 colin-byrneireland1

@danieldk Note TGI launcher document states

MAX_BATCH_TOTAL_TOKENS: Overall this number should be the largest possible amount that fits the remaining memory (after the model is loaded).

So is MAX_BATCH_TOTAL_TOKENS: in affect using the free GPU memory as KV Cache ?

Dec 07 '24 16:12 colin-byrneireland1

@danieldk any thoughts on the matter ?

For the total token budget estimation, the engine maps available memory to a total count of processable tokens. First, the engine calculates 95% of the available VRAM, leaving 5% room for error, where Available VRAM = GPU VRAM — Model VRAM — Prefill KV Cache VRAM. The available memory is then divided by the memory required to process a block of tokens [5] yielding the total number of tokens that can be processed simultaneously. This value is set as theMAX_BATCH_PREFILL_TOKENS, essentially the tokens that fit in a block times the number of blocks that fit into memory.

How can you see the Prefill KV Cache VRAM?

Dec 10 '24 11:12 colin-byrneireland1