text-generation-inference
text-generation-inference copied to clipboard
CUDA Out of memory when using the benchmarking tool with batch size greater than 1
System Info
- TGI v3.0.1
- OS: GCP Container-Optimized OS
- 4xL4 GPUs (24GB memory each)
- Model is
hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
As soon as I run the TGI benchmarking tool (text-generation-benchmark) with the desired input length for our use case and batch size of 2, I get CUDA Out of Memory and the TGI server stops.
TGI starting command:
docker run -d --network shared_network_no_internet \
--volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 \
--volume /var/lib/nvidia/bin:/usr/local/nvidia/bin \
--volume /mnt/disks/model:/mnt/disks/model \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidia1:/dev/nvidia1 \
--device /dev/nvidia2:/dev/nvidia2 \
--device /dev/nvidia3:/dev/nvidia3 \
--device /dev/nvidia-uvm:/dev/nvidia-uvm \
--device /dev/nvidiactl:/dev/nvidiactl \
-e HF_HUB_OFFLINE=true \
-e CUDA_VISIBLE_DEVICES=0,1,2,3 \
--shm-size 1g \
--name tgi-llama \
ghcr.io/huggingface/text-generation-inference:3.0.1 \
--model-id /mnt/disks/model/models--hugging-quants--Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
--port 3000 \
--sharded true \
--num-shard 4 \
--max-input-length 16500 \
--max-total-tokens 17500 \
--max-batch-size 7 \
--max-batch-total-tokens 125000 \
--max-concurrent-requests 30 \
--quantize awq
When starting TGI without max-batch-total-tokens, the logs were showing that I have max batch total tokens of 134902 available. That's why I came with a config like --max-batch-size 7 --max-batch-total-tokens 125000
INFO text_generation_launcher: Using prefill chunking = True
INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3
INFO shard-manager: text_generation_launcher: Shard ready in 148.169397034s rank=3
INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2
INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
INFO shard-manager: text_generation_launcher: Shard ready in 150.767169532s rank=2
INFO shard-manager: text_generation_launcher: Shard ready in 150.800946226s rank=0
INFO shard-manager: text_generation_launcher: Shard ready in 150.812528352s rank=1
INFO text_generation_launcher: Starting Webserver
INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
INFO text_generation_launcher: Using optimized Triton indexing kernels.
INFO text_generation_launcher: KV-cache blocks: 134902, size: 1
INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
INFO text_generation_router_v3: backends/v3/src/lib.rs:137: Setting max batch total tokens to 134902
WARN text_generation_router_v3::backend: backends/v3/src/backend.rs:39: Model supports prefill chunking. waiting_served_ratio and max_waiting_tokens will be ignored.
INFO text_generation_router_v3: backends/v3/src/lib.rs:166: Using backend V3
INFO text_generation_router::server: router/src/server.rs:1873: Using config Some(Llama)
This is how the GPU memory looks like after server startup:
I then run the benchmarking tool like this and get the OOM error:
text-generation-benchmark \
--tokenizer-name /mnt/disks/model/models--hugging-quants--Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
--sequence-length 6667 \
--decode-length 1000 \
--batch-size 2
Information
- [x] Docker
- [x] The CLI directly
Tasks
- [x] An officially supported command
- [ ] My own modifications
Reproduction
I run the benchmarking tool like this and get the OOM error:
text-generation-benchmark \
--tokenizer-name /mnt/disks/model/models--hugging-quants--Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
--sequence-length 6667 \
--decode-length 1000 \
--batch-size 2
This is how the error looks like:
2025-01-24T14:15:44.533276Z ERROR prefill{id=0 size=2}:prefill{id=0 size=2}: text_generation_client: backends/client/src/lib.rs:46: Server error: CUDA out of memory. Tried to allocate 182.00 MiB. GPU 1 has a total capacity of 22.06 GiB of which 103.12 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.40 GiB is allocated by PyTorch, with 24.46 MiB allocated in private pools (e.g., CUDA Graphs), and 41.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-01-24T14:15:44.535857Z ERROR prefill{id=0 size=2}:prefill{id=0 size=2}: text_generation_client: backends/client/src/lib.rs:46: Server error: CUDA out of memory. Tried to allocate 182.00 MiB. GPU 3 has a total capacity of 22.06 GiB of which 103.12 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.40 GiB is allocated by PyTorch, with 24.46 MiB allocated in private pools (e.g., CUDA Graphs), and 41.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-01-24T14:15:44.536587Z ERROR prefill{id=0 size=2}:prefill{id=0 size=2}: text_generation_client: backends/client/src/lib.rs:46: Server error: CUDA out of memory. Tried to allocate 182.00 MiB. GPU 0 has a total capacity of 22.06 GiB of which 103.12 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.40 GiB is allocated by PyTorch, with 24.46 MiB allocated in private pools (e.g., CUDA Graphs), and 41.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-01-24T14:15:44.536907Z ERROR prefill{id=0 size=2}:prefill{id=0 size=2}: text_generation_client: backends/client/src/lib.rs:46: Server error: CUDA out of memory. Tried to allocate 182.00 MiB. GPU 2 has a total capacity of 22.06 GiB of which 103.12 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.40 GiB is allocated by PyTorch, with 24.46 MiB allocated in private pools (e.g., CUDA Graphs), and 41.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Expected behavior
Because of the reported Setting max batch total tokens to 134902 I am expecting the TGI server to be able to handle requests of 6667 tokens in batches of 2. Is that not the case? What am I missing here?
Is it possible that the benchmarking tool is doing something weird?
Thank you!