text-generation-inference
text-generation-inference copied to clipboard
On-The-Fly Quantization for Inference appears not to be working as per documentation.
System Info
Platform: Dell 760xa with 4x L40S GPUs OS Description: Ubuntu 22.04.5 LTS GPU: NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 Python: 3.10.12 Docker: 26.1.5
Model: [Deploy Meta-Llama-3.1-8b-Instruct | Dell Enterprise Hub by Hugging Face] (https://dell.huggingface.co/authenticated/models/meta-llama/Meta-Llama-3.1-8b-Instruct/deploy/docker)
Tested with two versions of model containers: registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test -> TGI 2.4.0 registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct TGI -> 2.0.5.dev0
Information
- [ ] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
- Deploy model from https://dell.huggingface.co/authenticated/models/meta-llama/Meta-Llama-3.1-8b-Instruct/deploy/docker
- Run the following command(No Quantiziation) (Note) also tried variations see 6) below.
docker run -it --shm-size 1g -p 80:80 --gpus 2 -e NUM_SHARD=2 -e MAX_BATCH_PREFILL_TOKENS=16182 -e MAX_INPUT_TOKENS=8000 -e MAX_TOTAL_TOKENS=8192 registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test - From the host where the model is deployed run the following command
nvidia-sminote the gpu memory usage e.g 27629 GB for each GPU - Re Run step 2 this time with quantization specified.
docker run -it --shm-size 1g -p 80:80 --gpus 2 -e NUM_SHARD=2 -e MAX_BATCH_PREFILL_TOKENS=16182 -e MAX_INPUT_TOKENS=8000 -e MAX_TOTAL_TOKENS=8192 registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test --quantize bitsandbytes - From the host where the model is deployed run the following command
nvidia-sminote the gpu memory usage e.g 276701 GB for 1 GPU and 27027 for 2nd GPU - Expected on-the-fly memory quantization was tested for several Model Configurations combinations
TGI Container Versions 2.4.0 & 2.0.5.dev0 (Current DEH version):
registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test -> TGI 2.4.0
registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct TGI -> 2.0.5.dev0
Single and Dual GPU's
docker run -it --shm-size 1g -p 80:80 --gpus 1 -e NUM_SHARD=1docker run -it --shm-size 1g -p 80:80 --gpus 2 -e NUM_SHARD=2Quantize Options: Quantize options.bitsandbytes, bitsandbytes-fp4, bitsandbytes-nf4, fp8, eetq
Expected behavior
See attached PDF HF-TICKET-Quantization-Results.pdf Results.
Running e.g Llama 8B instruct with --quantize bitsandbytes we see minor or insignificant differences in GPU memory utilization. Note: Both TGI Container Versions show similar signatures.
On-the-Fly quantization for Inferencing doesn't appear to be working as expected.
will cut the memory requirement in half, will cut the memory requirement by 4x
Does bitsandbytes-fp4 and bitsandbytes-nf4 work ?
Does fp8 quantization work for on-the-fly inference quantization ?
Does eetq quantization work for on-the-fly inference quantization ?
Does on-the-fly quantization work with multi-GPU’s instances ?
Should different Input-Token Configs be used to see meaningful quantization results? eg.
{ MAX_BATCH_PREFILL_TOKENS=16182, MAX_INPUT_TOKENS=8000, MAX_TOTAL_TOKENS=8192 }