text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

google/gemma-3-27b-it context lenght issue

Open nskpro-cmd opened this issue 9 months ago • 6 comments

i have deployed the google/gemma-3-27b-it model on 4 H100 GPUS, it only supports 23k context length, when i increased to support 128k context window as it supports, i endup with following errors

i even tried with 64k context window, it went into cuda out of memeory issues

2025-03-13T08:36:37.262517Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.85.0 Commit sha: 411a28288de9218e2684dccbace481a1abdb0cef Docker label: sha-411a282 nvidia-smi: Thu Mar 13 08:36:36 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.127.08 Driver Version: 550.127.08 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:45:00.0 Off | 0 | | N/A 29C P0 70W / 700W | 1MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:4E:00.0 Off | 0 | | N/A 29C P0 69W / 700W | 1MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000001:1B:00.0 Off | 0 | | N/A 31C P0 71W / 700W | 1MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000001:24:00.0 Off | 0 | | N/A 28C P0 73W / 700W | 1MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ xpu-smi: N/A hpu-smi: N/A

2025-03-13T08:36:37.262563Z INFO text_generation_launcher: Args { model_id: "google/gemma-3-27b-it", revision: None, validation_workers: 2, sharded: Some( true, ), num_shard: Some( 4, ), quantize: None, speculate: None, dtype: None, kv_cache_dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: Some( 32000, ), max_input_length: None, max_total_tokens: Some( 64000, ), waiting_served_ratio: 0.3, max_batch_prefill_tokens: Some( 32000, ), max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "gemma-3-27b-it-5d7964566c-xnkck", port: 8000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/huggingface/hub", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], api_key: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 1, lora_adapters: None, usage_stats: Off, payload_limit: 2000000, enable_prefill_logprobs: false, } 2025-03-13T08:36:40.043396Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching False 2025-03-13T08:36:40.043429Z INFO text_generation_launcher: Sharding model on 4 processes 2025-03-13T08:36:40.043433Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2025-03-13T08:36:40.043785Z INFO download: text_generation_launcher: Starting check and download process for google/gemma-3-27b-it 2025-03-13T08:36:43.498233Z INFO text_generation_launcher: Files are already present on the host. Skipping download. 2025-03-13T08:36:44.060714Z INFO download: text_generation_launcher: Successfully downloaded weights for google/gemma-3-27b-it 2025-03-13T08:36:44.061471Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2025-03-13T08:36:44.590395Z INFO shard-manager: text_generation_launcher: Starting shard rank=1 2025-03-13T08:36:45.196166Z INFO shard-manager: text_generation_launcher: Starting shard rank=2 2025-03-13T08:36:45.867258Z INFO shard-manager: text_generation_launcher: Starting shard rank=3 2025-03-13T08:36:47.973482Z INFO text_generation_launcher: Using prefix caching = False 2025-03-13T08:36:47.973534Z INFO text_generation_launcher: Using Attention = flashinfer 2025-03-13T08:36:54.083888Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-13T08:36:54.609747Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2025-03-13T08:36:55.216572Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2 2025-03-13T08:36:55.888966Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3 2025-03-13T08:37:04.091352Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-13T08:37:04.617169Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2025-03-13T08:37:05.224253Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2 2025-03-13T08:37:05.896938Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3 2025-03-13T08:37:14.098533Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-13T08:37:14.624769Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2025-03-13T08:37:15.231953Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2 2025-03-13T08:37:15.904796Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3 2025-03-13T08:37:24.105963Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-13T08:37:24.632677Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2025-03-13T08:37:25.239656Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2 2025-03-13T08:37:25.912803Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3 2025-03-13T08:37:34.113333Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-13T08:37:34.641461Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2025-03-13T08:37:35.247092Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2 2025-03-13T08:37:35.920604Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3 2025-03-13T08:37:44.120842Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-13T08:37:44.649364Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2025-03-13T08:37:45.254347Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2 2025-03-13T08:37:45.928487Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3 2025-03-13T08:37:54.128489Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-13T08:37:54.657147Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2025-03-13T08:37:55.261709Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2 2025-03-13T08:37:55.936555Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3 2025-03-13T08:38:04.135901Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-13T08:38:04.664958Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2025-03-13T08:38:05.269205Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2 2025-03-13T08:38:05.944561Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3 2025-03-13T08:38:14.143354Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-13T08:38:14.672706Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2025-03-13T08:38:15.276730Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2 2025-03-13T08:38:15.952321Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3 2025-03-13T08:38:18.500055Z INFO text_generation_launcher: Using prefill chunking = False 2025-03-13T08:38:19.085091Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1 2025-03-13T08:38:19.176301Z INFO shard-manager: text_generation_launcher: Shard ready in 94.574638951s rank=1 2025-03-13T08:38:21.300395Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2 2025-03-13T08:38:21.301426Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0 2025-03-13T08:38:21.301937Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3 2025-03-13T08:38:21.348798Z INFO shard-manager: text_generation_launcher: Shard ready in 97.272539231s rank=0 2025-03-13T08:38:21.356498Z INFO shard-manager: text_generation_launcher: Shard ready in 95.475191243s rank=3 2025-03-13T08:38:21.385097Z INFO shard-manager: text_generation_launcher: Shard ready in 96.176034962s rank=2 2025-03-13T08:38:22.958763Z INFO text_generation_launcher: Starting Webserver 2025-03-13T08:38:23.126019Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model 2025-03-13T08:38:23.330948Z INFO text_generation_launcher: Using optimized Triton indexing kernels. 2025-03-13T08:38:25.345859Z ERROR text_generation_launcher: Method Warmup encountered an error. Traceback (most recent call last): File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup _, _batch, _ = self.generate_token(batch) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner return func(*args, **kwds) File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token out, speculative_logits = self.forward(batch, adapter_data) File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward logits, speculative_logits = self.model.forward( File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward hidden_states = self.text_model.model( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward hidden_states, residual = layer( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward attn_output = self.self_attn( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward attn_output = F.scaled_dot_product_attention( torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 3 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342359 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/src/.venv/bin/text-generation-server", line 10, in sys.exit(app()) File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call return get_command(self)(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call return self.main(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main return _main( File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main rv = self.invoke(ctx) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke return __callback(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper return callback(**use_params) File "/usr/src/server/text_generation_server/cli.py", line 119, in serve server.serve( File "/usr/src/server/text_generation_server/server.py", line 315, in serve asyncio.run( File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run return runner.run(main) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete self.run_forever() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever self._run_once() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once handle._run() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run self._context.run(self._callback, *self._args) File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method return await self.intercept(

File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept return await response File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor raise error File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor return await behavior(request_or_iterator, context) File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup self.model.warmup(batch, max_input_tokens, max_total_tokens) File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup raise RuntimeError( RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens 2025-03-13T08:38:25.349736Z ERROR text_generation_launcher: Method Warmup encountered an error. Traceback (most recent call last): File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup _, _batch, _ = self.generate_token(batch) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner return func(*args, **kwds) File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token out, speculative_logits = self.forward(batch, adapter_data) File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward logits, speculative_logits = self.model.forward( File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward hidden_states = self.text_model.model( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward hidden_states, residual = layer( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward attn_output = self.self_attn( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward attn_output = F.scaled_dot_product_attention( torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 1 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342101 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/src/.venv/bin/text-generation-server", line 10, in sys.exit(app()) File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call return get_command(self)(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call return self.main(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main return _main( File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main rv = self.invoke(ctx) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke return __callback(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper return callback(**use_params) File "/usr/src/server/text_generation_server/cli.py", line 119, in serve server.serve( File "/usr/src/server/text_generation_server/server.py", line 315, in serve asyncio.run( File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run return runner.run(main) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete self.run_forever() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever self._run_once() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once handle._run() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run self._context.run(self._callback, *self._args) File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method return await self.intercept(

File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept return await response File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor raise error File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor return await behavior(request_or_iterator, context) File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup self.model.warmup(batch, max_input_tokens, max_total_tokens) File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup raise RuntimeError( RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens 2025-03-13T08:38:25.350178Z ERROR text_generation_launcher: Method Warmup encountered an error. Traceback (most recent call last): File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup _, _batch, _ = self.generate_token(batch) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner return func(*args, **kwds) File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token out, speculative_logits = self.forward(batch, adapter_data) File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward logits, speculative_logits = self.model.forward( File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward hidden_states = self.text_model.model( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward hidden_states, residual = layer( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward attn_output = self.self_attn( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward attn_output = F.scaled_dot_product_attention( torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 2 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342216 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/src/.venv/bin/text-generation-server", line 10, in sys.exit(app()) File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call return get_command(self)(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call return self.main(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main return _main( File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main rv = self.invoke(ctx) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke return __callback(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper return callback(**use_params) File "/usr/src/server/text_generation_server/cli.py", line 119, in serve server.serve( File "/usr/src/server/text_generation_server/server.py", line 315, in serve asyncio.run( File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run return runner.run(main) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete self.run_forever() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever self._run_once() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once handle._run() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run self._context.run(self._callback, *self._args) File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method return await self.intercept(

File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept return await response File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor raise error File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor return await behavior(request_or_iterator, context) File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup self.model.warmup(batch, max_input_tokens, max_total_tokens) File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup raise RuntimeError( RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens 2025-03-13T08:38:25.350698Z ERROR text_generation_launcher: Method Warmup encountered an error. Traceback (most recent call last): File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup _, _batch, _ = self.generate_token(batch) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner return func(*args, **kwds) File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token out, speculative_logits = self.forward(batch, adapter_data) File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward logits, speculative_logits = self.model.forward( File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward hidden_states = self.text_model.model( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward hidden_states, residual = layer( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward attn_output = self.self_attn( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward attn_output = F.scaled_dot_product_attention( torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 0 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342032 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/src/.venv/bin/text-generation-server", line 10, in sys.exit(app()) File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call return get_command(self)(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call return self.main(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main return _main( File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main rv = self.invoke(ctx) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke return __callback(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper return callback(**use_params) File "/usr/src/server/text_generation_server/cli.py", line 119, in serve server.serve( File "/usr/src/server/text_generation_server/server.py", line 315, in serve asyncio.run( File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run return runner.run(main) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete self.run_forever() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever self._run_once() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once handle._run() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run self._context.run(self._callback, *self._args) File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method return await self.intercept(

File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept return await response File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor raise error File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor return await behavior(request_or_iterator, context) File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup self.model.warmup(batch, max_input_tokens, max_total_tokens) File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup raise RuntimeError( RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens 2025-03-13T08:38:25.358791Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens 2025-03-13T08:38:25.370414Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens 2025-03-13T08:38:25.381723Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens 2025-03-13T08:38:25.392642Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens Error: Backend(Warmup(Generation("Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens"))) 2025-03-13T08:38:25.403245Z ERROR text_generation_launcher: Webserver Crashed 2025-03-13T08:38:25.403260Z INFO text_generation_launcher: Shutting down shards 2025-03-13T08:38:25.452182Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0 2025-03-13T08:38:25.452239Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0 2025-03-13T08:38:25.459966Z INFO shard-manager: text_generation_launcher: Terminating shard rank=3 2025-03-13T08:38:25.462190Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=3 2025-03-13T08:38:25.481703Z INFO shard-manager: text_generation_launcher: Terminating shard rank=1 2025-03-13T08:38:25.481742Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1 2025-03-13T08:38:25.488581Z INFO shard-manager: text_generation_launcher: Terminating shard rank=2 2025-03-13T08:38:25.488620Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=2 2025-03-13T08:38:25.862773Z INFO shard-manager: text_generation_launcher: shard terminated rank=3 2025-03-13T08:38:27.053688Z INFO shard-manager: text_generation_launcher: shard terminated rank=0 2025-03-13T08:38:27.290200Z INFO shard-manager: text_generation_launcher: shard terminated rank=2 2025-03-13T08:38:27.583555Z INFO shard-manager: text_generation_launcher: shard terminated rank=1 Error: WebserverFailed

nskpro-cmd avatar Mar 13 '25 09:03 nskpro-cmd

@Narsil could you look at this please

nskpro-cmd avatar Mar 13 '25 09:03 nskpro-cmd

Thanks for reporting. One of the prefill code paths still uses Torch scaled dot product attention (rather than flashattention/flashinfer), requiring a padded representation of the key/value. 30 GiB is about the size of the key-value representations of a 64k sequence in the 27B model.

danieldk avatar Mar 13 '25 10:03 danieldk

so this need to be fix on code side(tgi server) or my deployment side?

nskpro-cmd avatar Mar 13 '25 13:03 nskpro-cmd

@danieldk Is there anything the user can resolve right now? Is this a problem that requires code fix?

calycekr avatar Mar 14 '25 00:03 calycekr

Needs a fix on our side.

danieldk avatar Mar 17 '25 09:03 danieldk

@danieldk Thanks for the confirmation

nskpro-cmd avatar Mar 17 '25 10:03 nskpro-cmd

Any update on this?

doc avatar Apr 03 '25 20:04 doc

Anything new with this?

n-imas avatar Apr 09 '25 13:04 n-imas

Is this likely to be resolved anytime soon? @danieldk if nobody is working on it, would you be able to point me in the direction of the code path that needs to be switched over from torch to FA/FI? Thanks

olliestanley avatar Apr 11 '25 08:04 olliestanley

We have a draft PR in #3167 in case anyone wants to try.

danieldk avatar Apr 11 '25 16:04 danieldk