text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

On-The-Fly Quantization for Inference appears not to be working as per documentation.

Open colin-byrneireland1 opened this issue 11 months ago • 7 comments

System Info

Platform: Dell 760xa with 4x L40S GPUs OS Description: Ubuntu 22.04.5 LTS GPU: NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 Python: 3.10.12 Docker: 26.1.5

Model: [Deploy Meta-Llama-3.1-8b-Instruct | Dell Enterprise Hub by Hugging Face] (https://dell.huggingface.co/authenticated/models/meta-llama/Meta-Llama-3.1-8b-Instruct/deploy/docker)

Tested with two versions of model containers: registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test -> TGI 2.4.0 registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct TGI -> 2.0.5.dev0

Information

  • [ ] Docker
  • [ ] The CLI directly

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

  1. Deploy model from https://dell.huggingface.co/authenticated/models/meta-llama/Meta-Llama-3.1-8b-Instruct/deploy/docker
  2. Run the following command(No Quantiziation) (Note) also tried variations see 6) below. docker run -it --shm-size 1g -p 80:80 --gpus 2 -e NUM_SHARD=2 -e MAX_BATCH_PREFILL_TOKENS=16182 -e MAX_INPUT_TOKENS=8000 -e MAX_TOTAL_TOKENS=8192 registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test
  3. From the host where the model is deployed run the following command nvidia-smi note the gpu memory usage e.g 27629 GB for each GPU
  4. Re Run step 2 this time with quantization specified. docker run -it --shm-size 1g -p 80:80 --gpus 2 -e NUM_SHARD=2 -e MAX_BATCH_PREFILL_TOKENS=16182 -e MAX_INPUT_TOKENS=8000 -e MAX_TOTAL_TOKENS=8192 registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test --quantize bitsandbytes
  5. From the host where the model is deployed run the following command nvidia-smi note the gpu memory usage e.g 276701 GB for 1 GPU and 27027 for 2nd GPU
  6. Expected on-the-fly memory quantization was tested for several Model Configurations combinations TGI Container Versions 2.4.0 & 2.0.5.dev0 (Current DEH version): registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test -> TGI 2.4.0 registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct TGI -> 2.0.5.dev0 Single and Dual GPU's docker run -it --shm-size 1g -p 80:80 --gpus 1 -e NUM_SHARD=1 docker run -it --shm-size 1g -p 80:80 --gpus 2 -e NUM_SHARD=2 Quantize Options: Quantize options. bitsandbytes, bitsandbytes-fp4, bitsandbytes-nf4, fp8, eetq

Expected behavior

See attached PDF HF-TICKET-Quantization-Results.pdf Results.

Running e.g Llama 8B instruct with --quantize bitsandbytes we see minor or insignificant differences in GPU memory utilization. Note: Both TGI Container Versions show similar signatures.

On-the-Fly quantization for Inferencing doesn't appear to be working as expected.

will cut the memory requirement in half, will cut the memory requirement by 4x

Does bitsandbytes-fp4 and bitsandbytes-nf4 work ? Does fp8 quantization work for on-the-fly inference quantization ? Does eetq quantization work for on-the-fly inference quantization ? Does on-the-fly quantization work with multi-GPU’s instances ? Should different Input-Token Configs be used to see meaningful quantization results? eg. { MAX_BATCH_PREFILL_TOKENS=16182, MAX_INPUT_TOKENS=8000, MAX_TOTAL_TOKENS=8192 }

colin-byrneireland1 avatar Nov 15 '24 10:11 colin-byrneireland1