text-generation-inference
text-generation-inference copied to clipboard
Excessive use of VRAM for Llama 3.1 8B
System Info
- text-generation-inference:2.3.0, deployed on docker
- model info: { "model_id": "meta-llama/Llama-3.1-8B-Instruct", "model_sha": "0e9e39f249a16976918f6564b8830bc894c89659", "model_pipeline_tag": "text-generation", "max_concurrent_requests": 128, "max_best_of": 2, "max_stop_sequences": 4, "max_input_tokens": 5000, "max_total_tokens": 6024, "validation_workers": 2, "max_client_batch_size": 4, "router": "text-generation-router", "version": "2.3.1-dev0", "sha": "169178b937d0c4173b0fdcd6bf10a858cfe4f428", "docker_label": "sha-169178b" }
- ubuntu 22.04
- 4x cards Nvidia L40S 48GB, Driver Version: 560.35.03, CUDA Version: 12.6
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [ ] An officially supported command
- [ ] My own modifications
Reproduction
Steps to reproduce:
- Run docker compose file:
services:
tgi:
container_name: tgi
image: ghcr.io/huggingface/text-generation-inference:2.3.0
restart: always
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities:
- gpu
shm_size: '192gb'
ports:
- 6500:80
environment:
- HF_TOKEN=<your-hf-token>
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
- SHARDED=true
- NUM_SHARD=4
- MAX_BATCH_SIZE=1
- CUDA_MEMORY_FRACTION=1
- MAX_INPUT_TOKENS=5000
- MAX_TOTAL_TOKENS=6024
Output logs:
2024-10-07T06:30:47.292774Z INFO text_generation_launcher: Args {
model_id: "meta-llama/Meta-Llama-3.1-8B-Instruct",
revision: None,
validation_workers: 2,
sharded: Some(
true,
),
num_shard: Some(
4,
),
quantize: None,
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: Some(
5000,
),
max_input_length: None,
max_total_tokens: Some(
6024,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: Some(
1,
),
cuda_graphs: None,
hostname: "eeb1ec72b169",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: Some(
"xxx",
),
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
}
2024-10-07T06:30:47.293479Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2024-10-07T06:30:47.293484Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 5000
2024-10-07T06:30:47.293487Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-10-07T06:30:47.293489Z INFO text_generation_launcher: Sharding model on 4 processes
2024-10-07T06:30:47.293556Z INFO download: text_generation_launcher: Starting check and download process for meta-llama/Meta-Llama-3.1-8B-Instruct
2024-10-07T06:30:49.757621Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-10-07T06:30:50.197536Z INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Meta-Llama-3.1-8B-Instruct
2024-10-07T06:30:50.197749Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-10-07T06:30:50.197801Z INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-10-07T06:30:50.198341Z INFO shard-manager: text_generation_launcher: Starting shard rank=2
2024-10-07T06:30:50.198363Z INFO shard-manager: text_generation_launcher: Starting shard rank=3
2024-10-07T06:30:52.534518Z INFO text_generation_launcher: Using prefix caching = True
2024-10-07T06:30:52.534550Z INFO text_generation_launcher: Using Attention = flashinfer
2024-10-07T06:31:00.208692Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-10-07T06:31:00.209210Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-10-07T06:31:00.209701Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2024-10-07T06:31:00.209782Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2024-10-07T06:31:03.994463Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2024-10-07T06:31:04.012487Z INFO shard-manager: text_generation_launcher: Shard ready in 13.812430798s rank=1
2024-10-07T06:31:04.291933Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-10-07T06:31:04.292206Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3
2024-10-07T06:31:04.292206Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2
2024-10-07T06:31:04.313117Z INFO shard-manager: text_generation_launcher: Shard ready in 14.113306239s rank=0
2024-10-07T06:31:04.313524Z INFO shard-manager: text_generation_launcher: Shard ready in 14.113391818s rank=3
2024-10-07T06:31:04.313770Z INFO shard-manager: text_generation_launcher: Shard ready in 14.113394813s rank=2
2024-10-07T06:31:04.411975Z INFO text_generation_launcher: Starting Webserver
2024-10-07T06:31:04.490925Z INFO text_generation_router_v3: backends/v3/src/lib.rs:90: Warming up model
2024-10-07T06:31:05.160689Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2024-10-07T06:31:06.175895Z INFO text_generation_router_v3: backends/v3/src/lib.rs:102: Setting max batch total tokens to 1084130
2024-10-07T06:31:06.175942Z INFO text_generation_router_v3: backends/v3/src/lib.rs:126: Using backend V3
2024-10-07T06:31:06.175988Z INFO text_generation_router::server: router/src/server.rs:1797: Using the Hugging Face API
2024-10-07T06:31:06.908180Z INFO text_generation_router::server: router/src/server.rs:2515: Serving revision 0e9e39f249a16976918f6564b8830bc894c89659 of model meta-llama/Llama-3.1-8B-Instruct
2024-10-07T06:31:08.905070Z INFO text_generation_router::server: router/src/server.rs:1943: Using config Some(Llama)
2024-10-07T06:31:08.905115Z WARN text_generation_router::server: router/src/server.rs:2090: Invalid hostname, defaulting to 0.0.0.0
2024-10-07T06:31:08.954949Z INFO text_generation_router::server: router/src/server.rs:2477: Connected
Expected behavior
Since I have specified in env variables MAX_TOTAL_TOKENS=6024 and MAX_BATCH_SIZE=1 I would expect the total tokens max batch total tokens to be 6024. Instead as can be seen in the logs the inferred max batch total tokens is set to be 1 084 130 and the VRAM usage goes up to 160GB! According to my calculations (based on this article), model should use 16GB of memory plus extra 3GB for 6024 tokens - 0,5MiB for each token for this particular model, correct me if I'm wrong.
To sum up:
Expected VRAM Usage: 20GB Actual VRAM Usage: 160GB
What can be a cause for such behavior? Am I doing something wrong or is it a bug?