text-generation-inference
text-generation-inference copied to clipboard
does tgi support Gemma 3 models?
Model description
i have a problem on running gemma 3 12B-it on my server. i have 2 gpus [Quadro rtx-8000] . when i want to run the model in server with docker i faced this error "window_size_left is only available with flash attn v2". this is my command for run the model : docker run -itd --gpus all -p 8090:80 -v /MODEL_PATH/models--google--gemma-3-12b-it:/models ghcr.io/huggingface/text-generation-inference:3.2.0 --model-id /models --trust-remote-code
- MODEL_PATH is my local path.
all stdout error is here :
2025-03-17T07:49:06.934488Z INFO text_generation_launcher: Args {
model_id: "/models",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: true,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "66b3e932e479",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
}
2025-03-17T07:49:08.661952Z INFO text_generation_launcher: Disabling prefix caching because of VLM model
2025-03-17T07:49:08.661984Z INFO text_generation_launcher: Forcing attention to 'paged' because head dim is not supported by flashinfer, also disabling prefix caching
2025-03-17T07:49:08.661996Z INFO text_generation_launcher: Using attention paged - Prefix caching 0
2025-03-17T07:49:08.685972Z WARN text_generation_launcher: Unkown compute for card quadro-rtx-8000
2025-03-17T07:49:08.708377Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 8000
2025-03-17T07:49:08.708391Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-03-17T07:49:08.708397Z WARN text_generation_launcher: trust_remote_code is set. Trusting that model /models do not contain malicious code.
2025-03-17T07:49:08.708523Z INFO download: text_generation_launcher: Starting check and download process for /models
2025-03-17T07:49:13.550626Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-03-17T07:49:14.431533Z INFO download: text_generation_launcher: Successfully downloaded weights for /models
2025-03-17T07:49:14.431782Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-03-17T07:49:19.487534Z INFO text_generation_launcher: Using prefix caching = False
2025-03-17T07:49:19.487608Z INFO text_generation_launcher: Using Attention = paged
2025-03-17T07:49:24.455842Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-17T07:49:34.464077Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-17T07:49:44.536315Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-17T07:49:54.609719Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-17T07:49:55.903453Z INFO text_generation_launcher: Using prefill chunking = False
2025-03-17T07:49:56.840458Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2025-03-17T07:49:56.911755Z INFO shard-manager: text_generation_launcher: Shard ready in 42.466748423s rank=0
2025-03-17T07:49:56.978093Z INFO text_generation_launcher: Starting Webserver
2025-03-17T07:49:57.029434Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2025-03-17T07:49:57.433717Z INFO text_generation_launcher: Using optimized Triton indexing kernels.
2025-03-17T07:50:33.808299Z INFO text_generation_launcher: KV-cache blocks: 1262, size: 16
2025-03-17T07:50:33.912370Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2025-03-17T07:50:37.271253Z INFO text_generation_router_v3: backends/v3/src/lib.rs:137: Setting max batch total tokens to 20192
2025-03-17T07:50:37.271292Z INFO text_generation_router_v3: backends/v3/src/lib.rs:166: Using backend V3
2025-03-17T07:50:37.271298Z INFO text_generation_router: backends/v3/src/main.rs:162: Maximum input tokens defaulted to 7999
2025-03-17T07:50:37.271303Z INFO text_generation_router: backends/v3/src/main.rs:168: Maximum total tokens defaulted to 8000
2025-03-17T07:50:37.271414Z WARN text_generation_router::server: router/src/server.rs:1648: Tokenizer_config None - Some("/models/tokenizer_config.json")
2025-03-17T07:50:37.273899Z INFO text_generation_router::server: router/src/server.rs:1661: Using chat template from chat_template.json
2025-03-17T07:50:45.068606Z INFO text_generation_router::server: router/src/server.rs:1716: Using config Some(Gemma3(Gemma3 { vision_config: Gemma3VisionConfig { image_size: 896, patch_size: 14 } }))
2025-03-17T07:50:45.068693Z WARN text_generation_router::server: router/src/server.rs:1776: no pipeline tag found for model /models
2025-03-17T07:50:45.068700Z WARN text_generation_router::server: router/src/server.rs:1879: Invalid hostname, defaulting to 0.0.0.0
2025-03-17T07:50:45.310356Z INFO text_generation_router::server: router/src/server.rs:2266: Connected
2025-03-17T07:51:38.198370Z ERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
File "/usr/src/.venv/bin/text-generation-server", line 10, in
File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept return await response File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor raise error File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor return await behavior(request_or_iterator, context) File "/usr/src/server/text_generation_server/server.py", line 183, in Prefill generations, next_batch, timings = self.model.generate_token(batch) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner return func(*args, **kwds) File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token out, speculative_logits = self.forward(batch, adapter_data) File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward logits, speculative_logits = self.model.forward( File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward hidden_states = self.text_model.model( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward hidden_states, residual = layer( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward attn_output = self.self_attn( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 253, in forward attn_output = attention( File "/usr/src/server/text_generation_server/layers/attention/cuda.py", line 295, in attention raise NotImplementedError( NotImplementedError: window_size_left is only available with flash attn v2 2025-03-17T07:51:38.199268Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: window_size_left is only available with flash attn v2 2025-03-17T07:51:38.200715Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: None, return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }}:async_stream:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:546: Request failed during generation: Server error: window_size_left is only available with flash attn v2 2025-03-17T07:51:40.101659Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2025-03-17 07:49:16.391 | INFO | text_generation_server.utils.import_utils:torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd(cast_inputs=torch.float16)
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
@custom_bwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
thank your for your attention .
@custom_bwd
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
Some kwargs in processor config are unused and will not have any effect: image_seq_length.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py:312: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
lengths_tensor = torch.tensor( rank=0
2025-03-17T07:51:40.192591Z ERROR text_generation_launcher: Shard 0 crashed
2025-03-17T07:51:40.192620Z INFO text_generation_launcher: Terminating webserver
2025-03-17T07:51:40.192640Z INFO text_generation_launcher: Waiting for webserver to gracefully shutdown
2025-03-17T07:51:40.192752Z INFO text_generation_router::server: router/src/server.rs:2363: signal received, starting graceful shutdown
2025-03-17T07:51:40.492975Z INFO text_generation_launcher: webserver terminated
2025-03-17T07:51:40.493004Z INFO text_generation_launcher: Shutting down shards
Error: ShardFailed
Open source status
- [ ] The model implementation is available
- [ ] The model weights are available
Provide useful links for the implementation
No response