"torch.cuda.OutOfMemoryError: CUDA out of memory" when deploy LLM with TGI in Kubernetes cluster
When deploy the LLM with TGI in k8s cluster with the pod like below:
apiVersion: v1 kind: Pod metadata: name: text-generation-inference labels: run: text-generation-inference spec: containers: - name: text-generation-inference image: ghcr.io/huggingface/text-generation-inference:latest env: - name: RUST_BACKTRACE value: "1" command: - "text-generation-launcher" - "--model-id" - "macadeliccc/laser-dolphin-mixtral-2x7b-dpo" - "--num-shard" - "1" ports: - containerPort: 80 name: http volumeMounts: - name: falcon-40b-instruct mountPath: /data - name: shm mountPath: /dev/shm volumes: - name: falcon-40b-instruct persistentVolumeClaim: claimName: falcon-40b-instruct - name: shm emptyDir: medium: Memory sizeLimit: 1Gi restartPolicy: Never
Got the error log in the pod as below:
2024-03-31T18:30:09.175228Z INFO text_generation_launcher: Args { model_id: "macadeliccc/laser-dolphin-mixtral-2x7b-dpo", revision: None, validation_workers: 2, sharded: None, num_shard: Some(1), quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, enable_cuda_graphs: false, hostname: "text-generation-inference", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false } 2024-03-31T18:30:09.175390Z INFO download: text_generation_launcher: Starting download process. 2024-03-31T18:30:12.843982Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-03-31T18:30:13.781674Z INFO download: text_generation_launcher: Successfully downloaded weights. 2024-03-31T18:30:13.781977Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-03-31T18:30:18.238592Z WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2
2024-03-31T18:30:23.795409Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-03-31T18:30:33.806809Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-03-31T18:30:43.817973Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-03-31T18:30:53.829077Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-03-31T18:31:03.840976Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-03-31T18:31:04.932785Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner model = get_model( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 472, in get_model return FlashMixtral( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mixtral.py", line 22, in init super(FlashMixtral, self).init( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 356, in init model = model_cls(config, weights) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 814, in init self.model = MixtralModel(config, weights) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 751, in init [ File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 752, in
MixtralLayer( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 691, in init self.moe = moe_cls(f"{prefix}.block_sparse_moe", config, weights) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 362, in init self.w3 = _load_experts(config, f"{prefix}.experts", "w3", weights) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 184, in _load_experts tensor[i * block_size : (i + 1) * block_size] = expert_slice.to( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 14.58 GiB of which 81.56 MiB is free. Process 166095 has 14.50 GiB memory in use. Of the allocated memory 14.20 GiB is allocated by PyTorch, and 189.40 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2024-03-31T18:31:06.544787Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result()
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 472, in get_model return FlashMixtral(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mixtral.py", line 22, in init super(FlashMixtral, self).init(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 356, in init model = model_cls(config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 814, in init self.model = MixtralModel(config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 751, in init [
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 752, in
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 691, in init self.moe = moe_cls(f"{prefix}.block_sparse_moe", config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 362, in init self.w3 = _load_experts(config, f"{prefix}.experts", "w3", weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 184, in _load_experts tensor[i * block_size : (i + 1) * block_size] = expert_slice.to(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 14.58 GiB of which 81.56 MiB is free. Process 166095 has 14.50 GiB memory in use. Of the allocated memory 14.20 GiB is allocated by PyTorch, and 189.40 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF rank=0 2024-03-31T18:31:06.635158Z ERROR text_generation_launcher: Shard 0 failed to start 2024-03-31T18:31:06.635190Z INFO text_generation_launcher: Shutting down shards Error: ShardCannotStart
Someone has seen similar issue? I am quite new in the area, and can not find the root cause of the issue.
What's the GPU ?
You are using compute_cap 7.5 so I'll guess T4. T4 simply don't have enough vram to run this model out of the box, you can try using quantization maybe
https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#quantize
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.