text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

does tgi support Gemma 3 models?

Open Behnamhb opened this issue 7 months ago • 1 comments

Model description

i have a problem on running gemma 3 12B-it on my server. i have 2 gpus [Quadro rtx-8000] . when i want to run the model in server with docker i faced this error "window_size_left is only available with flash attn v2". this is my command for run the model : docker run -itd --gpus all -p 8090:80 -v /MODEL_PATH/models--google--gemma-3-12b-it:/models ghcr.io/huggingface/text-generation-inference:3.2.0 --model-id /models --trust-remote-code

  • MODEL_PATH is my local path.

all stdout error is here : 2025-03-17T07:49:06.934488Z INFO text_generation_launcher: Args { model_id: "/models", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, kv_cache_dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "66b3e932e479", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], api_key: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, usage_stats: On, payload_limit: 2000000, enable_prefill_logprobs: false, } 2025-03-17T07:49:08.661952Z INFO text_generation_launcher: Disabling prefix caching because of VLM model 2025-03-17T07:49:08.661984Z INFO text_generation_launcher: Forcing attention to 'paged' because head dim is not supported by flashinfer, also disabling prefix caching 2025-03-17T07:49:08.661996Z INFO text_generation_launcher: Using attention paged - Prefix caching 0 2025-03-17T07:49:08.685972Z WARN text_generation_launcher: Unkown compute for card quadro-rtx-8000 2025-03-17T07:49:08.708377Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 8000 2025-03-17T07:49:08.708391Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2025-03-17T07:49:08.708397Z WARN text_generation_launcher: trust_remote_code is set. Trusting that model /models do not contain malicious code. 2025-03-17T07:49:08.708523Z INFO download: text_generation_launcher: Starting check and download process for /models 2025-03-17T07:49:13.550626Z INFO text_generation_launcher: Files are already present on the host. Skipping download. 2025-03-17T07:49:14.431533Z INFO download: text_generation_launcher: Successfully downloaded weights for /models 2025-03-17T07:49:14.431782Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2025-03-17T07:49:19.487534Z INFO text_generation_launcher: Using prefix caching = False 2025-03-17T07:49:19.487608Z INFO text_generation_launcher: Using Attention = paged 2025-03-17T07:49:24.455842Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-17T07:49:34.464077Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-17T07:49:44.536315Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-17T07:49:54.609719Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-17T07:49:55.903453Z INFO text_generation_launcher: Using prefill chunking = False 2025-03-17T07:49:56.840458Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0 2025-03-17T07:49:56.911755Z INFO shard-manager: text_generation_launcher: Shard ready in 42.466748423s rank=0 2025-03-17T07:49:56.978093Z INFO text_generation_launcher: Starting Webserver 2025-03-17T07:49:57.029434Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model 2025-03-17T07:49:57.433717Z INFO text_generation_launcher: Using optimized Triton indexing kernels. 2025-03-17T07:50:33.808299Z INFO text_generation_launcher: KV-cache blocks: 1262, size: 16 2025-03-17T07:50:33.912370Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1] 2025-03-17T07:50:37.271253Z INFO text_generation_router_v3: backends/v3/src/lib.rs:137: Setting max batch total tokens to 20192 2025-03-17T07:50:37.271292Z INFO text_generation_router_v3: backends/v3/src/lib.rs:166: Using backend V3 2025-03-17T07:50:37.271298Z INFO text_generation_router: backends/v3/src/main.rs:162: Maximum input tokens defaulted to 7999 2025-03-17T07:50:37.271303Z INFO text_generation_router: backends/v3/src/main.rs:168: Maximum total tokens defaulted to 8000 2025-03-17T07:50:37.271414Z WARN text_generation_router::server: router/src/server.rs:1648: Tokenizer_config None - Some("/models/tokenizer_config.json") 2025-03-17T07:50:37.273899Z INFO text_generation_router::server: router/src/server.rs:1661: Using chat template from chat_template.json 2025-03-17T07:50:45.068606Z INFO text_generation_router::server: router/src/server.rs:1716: Using config Some(Gemma3(Gemma3 { vision_config: Gemma3VisionConfig { image_size: 896, patch_size: 14 } })) 2025-03-17T07:50:45.068693Z WARN text_generation_router::server: router/src/server.rs:1776: no pipeline tag found for model /models 2025-03-17T07:50:45.068700Z WARN text_generation_router::server: router/src/server.rs:1879: Invalid hostname, defaulting to 0.0.0.0 2025-03-17T07:50:45.310356Z INFO text_generation_router::server: router/src/server.rs:2266: Connected 2025-03-17T07:51:38.198370Z ERROR text_generation_launcher: Method Prefill encountered an error. Traceback (most recent call last): File "/usr/src/.venv/bin/text-generation-server", line 10, in sys.exit(app()) File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call return get_command(self)(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call return self.main(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main return _main( File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main rv = self.invoke(ctx) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke return __callback(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper return callback(**use_params) File "/usr/src/server/text_generation_server/cli.py", line 119, in serve server.serve( File "/usr/src/server/text_generation_server/server.py", line 315, in serve asyncio.run( File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run return runner.run(main) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete self.run_forever() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever self._run_once() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once handle._run() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run self._context.run(self._callback, *self._args) File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method return await self.intercept(

File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept return await response File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor raise error File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor return await behavior(request_or_iterator, context) File "/usr/src/server/text_generation_server/server.py", line 183, in Prefill generations, next_batch, timings = self.model.generate_token(batch) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner return func(*args, **kwds) File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token out, speculative_logits = self.forward(batch, adapter_data) File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward logits, speculative_logits = self.model.forward( File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward hidden_states = self.text_model.model( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward hidden_states, residual = layer( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward attn_output = self.self_attn( File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 253, in forward attn_output = attention( File "/usr/src/server/text_generation_server/layers/attention/cuda.py", line 295, in attention raise NotImplementedError( NotImplementedError: window_size_left is only available with flash attn v2 2025-03-17T07:51:38.199268Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: window_size_left is only available with flash attn v2 2025-03-17T07:51:38.200715Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: None, return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }}:async_stream:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:546: Request failed during generation: Server error: window_size_left is only available with flash attn v2 2025-03-17T07:51:40.101659Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2025-03-17 07:49:16.391 | INFO | text_generation_server.utils.import_utils::80 - Detected system cuda /usr/src/server/text_generation_server/layers/gptq/triton.py:242: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. @custom_fwd(cast_inputs=torch.float16) /usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. @custom_fwd /usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. @custom_bwd /usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. @custom_fwd /usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.

thank your for your attention . @custom_bwd The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. Some kwargs in processor config are unused and will not have any effect: image_seq_length. The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. /usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py:312: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). lengths_tensor = torch.tensor( rank=0 2025-03-17T07:51:40.192591Z ERROR text_generation_launcher: Shard 0 crashed 2025-03-17T07:51:40.192620Z INFO text_generation_launcher: Terminating webserver 2025-03-17T07:51:40.192640Z INFO text_generation_launcher: Waiting for webserver to gracefully shutdown 2025-03-17T07:51:40.192752Z INFO text_generation_router::server: router/src/server.rs:2363: signal received, starting graceful shutdown 2025-03-17T07:51:40.492975Z INFO text_generation_launcher: webserver terminated 2025-03-17T07:51:40.493004Z INFO text_generation_launcher: Shutting down shards Error: ShardFailed

Open source status

  • [ ] The model implementation is available
  • [ ] The model weights are available

Provide useful links for the implementation

No response

Behnamhb avatar Mar 17 '25 08:03 Behnamhb