text-generation-inference "torch.cuda.OutOfMemoryError: CUDA out of memory" when deploy LLM with TGI in Kubernetes cluster

When deploy the LLM with TGI in k8s cluster with the pod like below:

apiVersion: v1 kind: Pod metadata: name: text-generation-inference labels: run: text-generation-inference spec: containers: - name: text-generation-inference image: ghcr.io/huggingface/text-generation-inference:latest env: - name: RUST_BACKTRACE value: "1" command: - "text-generation-launcher" - "--model-id" - "macadeliccc/laser-dolphin-mixtral-2x7b-dpo" - "--num-shard" - "1" ports: - containerPort: 80 name: http volumeMounts: - name: falcon-40b-instruct mountPath: /data - name: shm mountPath: /dev/shm volumes: - name: falcon-40b-instruct persistentVolumeClaim: claimName: falcon-40b-instruct - name: shm emptyDir: medium: Memory sizeLimit: 1Gi restartPolicy: Never

Got the error log in the pod as below:

2024-03-31T18:30:09.175228Z INFO text_generation_launcher: Args { model_id: "macadeliccc/laser-dolphin-mixtral-2x7b-dpo", revision: None, validation_workers: 2, sharded: None, num_shard: Some(1), quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, enable_cuda_graphs: false, hostname: "text-generation-inference", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false } 2024-03-31T18:30:09.175390Z INFO download: text_generation_launcher: Starting download process. 2024-03-31T18:30:12.843982Z INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-03-31T18:30:13.781674Z INFO download: text_generation_launcher: Successfully downloaded weights. 2024-03-31T18:30:13.781977Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-03-31T18:30:18.238592Z WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2024-03-31T18:30:23.795409Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-03-31T18:30:33.806809Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-03-31T18:30:43.817973Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-03-31T18:30:53.829077Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-03-31T18:31:03.840976Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-03-31T18:31:04.932785Z ERROR text_generation_launcher: Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve server.serve( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve asyncio.run( File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner model = get_model( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 472, in get_model return FlashMixtral( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mixtral.py", line 22, in init super(FlashMixtral, self).init( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 356, in init model = model_cls(config, weights) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 814, in init self.model = MixtralModel(config, weights) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 751, in init [ File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 752, in MixtralLayer( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 691, in init self.moe = moe_cls(f"{prefix}.block_sparse_moe", config, weights) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 362, in init self.w3 = _load_experts(config, f"{prefix}.experts", "w3", weights) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 184, in _load_experts tensor[i * block_size : (i + 1) * block_size] = expert_slice.to( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 14.58 GiB of which 81.56 MiB is free. Process 166095 has 14.50 GiB memory in use. Of the allocated memory 14.20 GiB is allocated by PyTorch, and 189.40 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

2024-03-31T18:31:06.544787Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve server.serve(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve asyncio.run(

File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main)

File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result()

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner model = get_model(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 472, in get_model return FlashMixtral(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mixtral.py", line 22, in init super(FlashMixtral, self).init(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 356, in init model = model_cls(config, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 814, in init self.model = MixtralModel(config, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 751, in init [

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 752, in MixtralLayer(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 691, in init self.moe = moe_cls(f"{prefix}.block_sparse_moe", config, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 362, in init self.w3 = _load_experts(config, f"{prefix}.experts", "w3", weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 184, in _load_experts tensor[i * block_size : (i + 1) * block_size] = expert_slice.to(

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 14.58 GiB of which 81.56 MiB is free. Process 166095 has 14.50 GiB memory in use. Of the allocated memory 14.20 GiB is allocated by PyTorch, and 189.40 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF rank=0 2024-03-31T18:31:06.635158Z ERROR text_generation_launcher: Shard 0 failed to start 2024-03-31T18:31:06.635190Z INFO text_generation_launcher: Shutting down shards Error: ShardCannotStart

Someone has seen similar issue? I am quite new in the area, and can not find the root cause of the issue.

Mar 31 '24 18:03 leyuzhu

What's the GPU ?

You are using compute_cap 7.5 so I'll guess T4. T4 simply don't have enough vram to run this model out of the box, you can try using quantization maybe

https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#quantize

Apr 10 '24 09:04 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 11 '24 01:05 github-actions[bot]