[Bug]: 5090 gemma-3-12b-it using FP8/INT8/FP16 quantization for conncurent requests DOCKER.

Open lukaLLM opened this issue 8 months ago • 0 comments

Your current environment

I use docker only to host open ai server.

🐛 Describe the bug

I try to use VLLM docker as my backend to run Gemma3 models and use concurrency and dynamic batching.
I use 22.04 Ubuntu 570.153.02 Drivers Nvidia CUDA Version: 12.8 I know I should update to 12.9 24.04 ubuntu but so far all worked I used like faster-whisper docker for llama servers etc.

I followed these issues: https://github.com/vllm-project/vllm/issues/17587 https://github.com/vllm-project/vllm/pull/14766 https://github.com/vllm-project/vllm/issues/14452

The only thing that seem to work for me was solution from 14452 of hongbo-miao Server:

docker run --gpus=all
--volume="$HOME/.cache/huggingface:/root/.cache/huggingface"
--publish=8000:8000
nvcr.io/nvidia/tritonserver:25.05-vllm-python-py3
python3 -m vllm.entrypoints.openai.api_server
--model=Qwen/Qwen2.5-0.5B-Instruct
--port=8000
--gpu-memory-utilization=0.75
--max_model_len=8192
--tensor-parallel-size=1
--max_num_seqs=128
--enforce-eager

Client:

curl http://localhost:8000/v1/chat/completions
--header "Content-Type: application/json"
--data '{ "model": "Qwen/Qwen2.5-0.5B-Instruct", "messages": [
{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me a joke."} ] }'

which returns

{ "id": "chat-260127bb79b74e3786b810ffa6f592ed", "object": "chat.completion", "created": 1749898826, "model": "Qwen/Qwen2.5-0.5B-Instruct", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Sure! Here's one for you: Why did the tomato turn red? Because it saw the salad dressing!", "tool_calls": [] }, "logprobs": null, "finish_reason": "stop", "stop_reason": null } ], "usage": { "prompt_tokens": 24, "total_tokens": 47, "completion_tokens": 23 }, "prompt_logprobs": null }

This works on my pc but when I try, models like MISHANM/google-gemma-3-12b-it-fp8 JamAndTeaStudios/gemma-3-12b-it-FP8-Dynamic RedHatAI/gemma-3-12b-it-FP8-dynamic

like running docker run --gpus=all -v "$HOME/.cache/huggingface:/root/.cache/huggingface" -p 8000:8000 --name gemma3_12b_fp8 tritonserver-bnb python3 -m vllm.entrypoints.openai.api_server --model=RedHatAI/gemma-3-12b-it-FP8-dynamic --port=8000 --gpu-memory-utilization=0.90 --max_model_len=8192 --tensor-parallel-size=1 --max_num_seqs=128 --enforce-eager

I get error

============================= == Triton Inference Server ==

NVIDIA Release 25.05 (build 172940304) Triton Server Version 2.58.0

Copyright (c) 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the Product-Specific Terms for NVIDIA AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).

WARNING: CUDA Minor Version Compatibility mode ENABLED. Using driver version 570.153.02 which has support for CUDA 12.8. This container was built with CUDA 12.9 and will be run in Minor Version Compatibility mode. CUDA Forward Compatibility is preferred over Minor Version Compatibility for use with this container but was unavailable: [[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]] See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

INFO 06-19 13:40:04 [init.py:239] Automatically detected platform cuda. INFO 06-19 13:40:05 [api_server.py:1034] vLLM API server version 0.8.4+dc1a3e10.nv25.05 INFO 06-19 13:40:05 [api_server.py:1035] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='RedHatAI/gemma-3-12b-it-FP8-dynamic', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=128, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False) INFO 06-19 13:40:10 [config.py:689] This model supports multiple tasks: {'score', 'reward', 'classify', 'generate', 'embed'}. Defaulting to 'generate'. INFO 06-19 13:40:10 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=2048. WARNING 06-19 13:40:10 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used INFO 06-19 13:40:13 [init.py:239] Automatically detected platform cuda. INFO 06-19 13:40:15 [core.py:61] Initializing a V1 LLM engine (v0.8.4+dc1a3e10.nv25.05) with config: model='RedHatAI/gemma-3-12b-it-FP8-dynamic', speculative_config=None, tokenizer='RedHatAI/gemma-3-12b-it-FP8-dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=RedHatAI/gemma-3-12b-it-FP8-dynamic, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0} 2025-06-19 13:40:15,592 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend WARNING 06-19 13:40:15 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7a9b659bf4d0> [W619 13:40:16.689999241 ProcessGroupNCCL.cpp:959] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator()) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 INFO 06-19 13:40:16 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0 INFO 06-19 13:40:16 [cuda.py:221] Using Flash Attention backend on V1 engine. Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False. INFO 06-19 13:40:20 [gpu_model_runner.py:1276] Starting to load model RedHatAI/gemma-3-12b-it-FP8-dynamic... INFO 06-19 13:40:20 [config.py:3466] cudagraph sizes specified by model runner [] is overridden by config [] INFO 06-19 13:40:20 [topk_topp_sampler.py:44] Currently, FlashInfer top-p & top-k sampling sampler is disabled because FlashInfer>=v0.2.3 is not backward compatible. Falling back to the PyTorch-native implementation of top-p & top-k sampling. INFO 06-19 13:40:21 [weight_utils.py:265] Using model weights format ['.safetensors'] Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:02<00:04, 2.12s/it] Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:04<00:02, 2.14s/it] Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00, 1.88s/it] Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00, 1.95s/it]

INFO 06-19 13:40:27 [loader.py:458] Loading weights took 5.92 seconds INFO 06-19 13:40:27 [gpu_model_runner.py:1291] Model loading took 13.2955 GiB and 6.692338 seconds INFO 06-19 13:40:27 [gpu_model_runner.py:1560] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size. ERROR 06-19 13:40:30 [core.py:387] EngineCore hit an exception: Traceback (most recent call last): ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 378, in run_engine_core ERROR 06-19 13:40:30 [core.py:387] engine_core = EngineCoreProc(*args, **kwargs) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 320, in init ERROR 06-19 13:40:30 [core.py:387] super().init(vllm_config, executor_class, log_stats) ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 71, in init ERROR 06-19 13:40:30 [core.py:387] self._initialize_kv_caches(vllm_config) ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 133, in _initialize_kv_caches ERROR 06-19 13:40:30 [core.py:387] available_gpu_memory = self.model_executor.determine_available_memory() ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 66, in determine_available_memory ERROR 06-19 13:40:30 [core.py:387] output = self.collective_rpc("determine_available_memory") ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc ERROR 06-19 13:40:30 [core.py:387] answer = run_method(self.driver_worker, method, args, kwargs) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2378, in run_method ERROR 06-19 13:40:30 [core.py:387] return func(*args, **kwargs) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 06-19 13:40:30 [core.py:387] return func(*args, **kwargs) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 157, in determine_available_memory ERROR 06-19 13:40:30 [core.py:387] self.model_runner.profile_run() ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1591, in profile_run ERROR 06-19 13:40:30 [core.py:387] hidden_states = self._dummy_run(self.max_num_tokens) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 06-19 13:40:30 [core.py:387] return func(*args, **kwargs) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1441, in _dummy_run ERROR 06-19 13:40:30 [core.py:387] hidden_states = model( ERROR 06-19 13:40:30 [core.py:387] ^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl ERROR 06-19 13:40:30 [core.py:387] return self._call_impl(*args, **kwargs) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl ERROR 06-19 13:40:30 [core.py:387] return forward_call(*args, **kwargs) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 630, in forward ERROR 06-19 13:40:30 [core.py:387] hidden_states = self.language_model.model(input_ids, ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in call ERROR 06-19 13:40:30 [core.py:387] return self.forward(*args, **kwargs) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 400, in forward ERROR 06-19 13:40:30 [core.py:387] hidden_states, residual = layer( ERROR 06-19 13:40:30 [core.py:387] ^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl ERROR 06-19 13:40:30 [core.py:387] return self._call_impl(*args, **kwargs) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl ERROR 06-19 13:40:30 [core.py:387] return forward_call(*args, **kwargs) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 329, in forward ERROR 06-19 13:40:30 [core.py:387] hidden_states = self.self_attn( ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl ERROR 06-19 13:40:30 [core.py:387] return self._call_impl(*args, **kwargs) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl ERROR 06-19 13:40:30 [core.py:387] return forward_call(*args, **kwargs) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 188, in forward ERROR 06-19 13:40:30 [core.py:387] qkv, _ = self.qkv_proj(hidden_states) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl ERROR 06-19 13:40:30 [core.py:387] return self._call_impl(*args, **kwargs) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in call_impl ERROR 06-19 13:40:30 [core.py:387] return forward_call(*args, **kwargs) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 474, in forward ERROR 06-19 13:40:30 [core.py:387] output_parallel = self.quant_method.apply(self, input, bias) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 580, in apply ERROR 06-19 13:40:30 [core.py:387] return scheme.apply_weights(layer, x, bias=bias) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py", line 144, in apply_weights ERROR 06-19 13:40:30 [core.py:387] return self.fp8_linear.apply(input=x, ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 200, in apply ERROR 06-19 13:40:30 [core.py:387] output = ops.cutlass_scaled_mm(qinput, ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 568, in cutlass_scaled_mm ERROR 06-19 13:40:30 [core.py:387] torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias) ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in call ERROR 06-19 13:40:30 [core.py:387] return self._op(*args, **(kwargs or {})) ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:40:30 [core.py:387] RuntimeError: Error Internal ERROR 06-19 13:40:30 [core.py:387] CRITICAL 06-19 13:40:30 [core_client.py:359] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue. Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in uvloop.run(run_server(args)) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run return __asyncio.run( ^^^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.12/contextlib.py", line 210, in aenter return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.12/contextlib.py", line 210, in aenter return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args async_llm = AsyncLLM.from_vllm_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config return cls( ^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 102, in init self.engine_core = EngineCoreClient.make_client( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 71, in make_client return AsyncMPClient(vllm_config, executor_class, log_stats) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 604, in init super().init( File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 404, in init self._wait_for_engine_startup() File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 426, in _wait_for_engine_startup raise RuntimeError("Engine core initialization failed. " RuntimeError: Engine core initialization failed. See root cause above.

When I run RedHatAI/gemma-3-12b-it-quantized.w8a8

RuntimeError: Currently, only fp8 gemm is implemented for Blackwell

For RedHatAI/gemma-3-12b-it-FP8-dynamic

============================= == Triton Inference Server ==

NVIDIA Release 25.05 (build 172940304) Triton Server Version 2.58.0

Copyright (c) 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the Product-Specific Terms for NVIDIA AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).

WARNING: CUDA Minor Version Compatibility mode ENABLED. Using driver version 570.153.02 which has support for CUDA 12.8. This container was built with CUDA 12.9 and will be run in Minor Version Compatibility mode. CUDA Forward Compatibility is preferred over Minor Version Compatibility for use with this container but was unavailable: [[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]] See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

INFO 06-19 13:46:11 [init.py:239] Automatically detected platform cuda. INFO 06-19 13:46:11 [api_server.py:1034] vLLM API server version 0.8.4+dc1a3e10.nv25.05 INFO 06-19 13:46:11 [api_server.py:1035] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='RedHatAI/gemma-3-12b-it-FP8-dynamic', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=128, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False) INFO 06-19 13:46:16 [config.py:689] This model supports multiple tasks: {'score', 'generate', 'embed', 'classify', 'reward'}. Defaulting to 'generate'. INFO 06-19 13:46:17 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=2048. WARNING 06-19 13:46:17 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used INFO 06-19 13:46:20 [init.py:239] Automatically detected platform cuda. INFO 06-19 13:46:21 [core.py:61] Initializing a V1 LLM engine (v0.8.4+dc1a3e10.nv25.05) with config: model='RedHatAI/gemma-3-12b-it-FP8-dynamic', speculative_config=None, tokenizer='RedHatAI/gemma-3-12b-it-FP8-dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=RedHatAI/gemma-3-12b-it-FP8-dynamic, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0} 2025-06-19 13:46:21,834 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend WARNING 06-19 13:46:22 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x773ce02bdee0> [W619 13:46:22.908698114 ProcessGroupNCCL.cpp:959] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator()) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 INFO 06-19 13:46:22 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0 INFO 06-19 13:46:22 [cuda.py:221] Using Flash Attention backend on V1 engine. Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False. INFO 06-19 13:46:26 [gpu_model_runner.py:1276] Starting to load model RedHatAI/gemma-3-12b-it-FP8-dynamic... INFO 06-19 13:46:26 [config.py:3466] cudagraph sizes specified by model runner [] is overridden by config [] INFO 06-19 13:46:26 [topk_topp_sampler.py:44] Currently, FlashInfer top-p & top-k sampling sampler is disabled because FlashInfer>=v0.2.3 is not backward compatible. Falling back to the PyTorch-native implementation of top-p & top-k sampling. INFO 06-19 13:46:26 [weight_utils.py:265] Using model weights format ['.safetensors'] Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.71it/s] Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:01<00:00, 1.52it/s] Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 1.68it/s] Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 1.65it/s]

INFO 06-19 13:46:29 [loader.py:458] Loading weights took 1.89 seconds INFO 06-19 13:46:29 [gpu_model_runner.py:1291] Model loading took 13.2955 GiB and 2.631252 seconds INFO 06-19 13:46:29 [gpu_model_runner.py:1560] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size. ERROR 06-19 13:46:31 [core.py:387] EngineCore hit an exception: Traceback (most recent call last): ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 378, in run_engine_core ERROR 06-19 13:46:31 [core.py:387] engine_core = EngineCoreProc(*args, **kwargs) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 320, in init ERROR 06-19 13:46:31 [core.py:387] super().init(vllm_config, executor_class, log_stats) ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 71, in init ERROR 06-19 13:46:31 [core.py:387] self._initialize_kv_caches(vllm_config) ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 133, in _initialize_kv_caches ERROR 06-19 13:46:31 [core.py:387] available_gpu_memory = self.model_executor.determine_available_memory() ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 66, in determine_available_memory ERROR 06-19 13:46:31 [core.py:387] output = self.collective_rpc("determine_available_memory") ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc ERROR 06-19 13:46:31 [core.py:387] answer = run_method(self.driver_worker, method, args, kwargs) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2378, in run_method ERROR 06-19 13:46:31 [core.py:387] return func(*args, **kwargs) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 06-19 13:46:31 [core.py:387] return func(*args, **kwargs) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 157, in determine_available_memory ERROR 06-19 13:46:31 [core.py:387] self.model_runner.profile_run() ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1591, in profile_run ERROR 06-19 13:46:31 [core.py:387] hidden_states = self._dummy_run(self.max_num_tokens) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 06-19 13:46:31 [core.py:387] return func(*args, **kwargs) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1441, in _dummy_run ERROR 06-19 13:46:31 [core.py:387] hidden_states = model( ERROR 06-19 13:46:31 [core.py:387] ^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl ERROR 06-19 13:46:31 [core.py:387] return self._call_impl(*args, **kwargs) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl ERROR 06-19 13:46:31 [core.py:387] return forward_call(*args, **kwargs) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 630, in forward ERROR 06-19 13:46:31 [core.py:387] hidden_states = self.language_model.model(input_ids, ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in call ERROR 06-19 13:46:31 [core.py:387] return self.forward(*args, **kwargs) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 400, in forward ERROR 06-19 13:46:31 [core.py:387] hidden_states, residual = layer( ERROR 06-19 13:46:31 [core.py:387] ^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl ERROR 06-19 13:46:31 [core.py:387] return self._call_impl(*args, **kwargs) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl ERROR 06-19 13:46:31 [core.py:387] return forward_call(*args, **kwargs) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 329, in forward ERROR 06-19 13:46:31 [core.py:387] hidden_states = self.self_attn( ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl ERROR 06-19 13:46:31 [core.py:387] return self._call_impl(*args, **kwargs) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl ERROR 06-19 13:46:31 [core.py:387] return forward_call(*args, **kwargs) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 188, in forward ERROR 06-19 13:46:31 [core.py:387] qkv, _ = self.qkv_proj(hidden_states) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl ERROR 06-19 13:46:31 [core.py:387] return self._call_impl(*args, **kwargs) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in call_impl ERROR 06-19 13:46:31 [core.py:387] return forward_call(*args, **kwargs) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 474, in forward ERROR 06-19 13:46:31 [core.py:387] output_parallel = self.quant_method.apply(self, input, bias) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 580, in apply ERROR 06-19 13:46:31 [core.py:387] return scheme.apply_weights(layer, x, bias=bias) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py", line 144, in apply_weights ERROR 06-19 13:46:31 [core.py:387] return self.fp8_linear.apply(input=x, ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 200, in apply ERROR 06-19 13:46:31 [core.py:387] output = ops.cutlass_scaled_mm(qinput, ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 568, in cutlass_scaled_mm ERROR 06-19 13:46:31 [core.py:387] torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias) ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in call ERROR 06-19 13:46:31 [core.py:387] return self._op(*args, **(kwargs or {})) ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-19 13:46:31 [core.py:387] RuntimeError: Error Internal ERROR 06-19 13:46:31 [core.py:387] CRITICAL 06-19 13:46:31 [core_client.py:359] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue. Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in uvloop.run(run_server(args)) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run return __asyncio.run( ^^^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.12/contextlib.py", line 210, in aenter return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.12/contextlib.py", line 210, in aenter return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args async_llm = AsyncLLM.from_vllm_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config return cls( ^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 102, in init self.engine_core = EngineCoreClient.make_client( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 71, in make_client return AsyncMPClient(vllm_config, executor_class, log_stats) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 604, in init super().init( File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 404, in init self._wait_for_engine_startup() File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 426, in _wait_for_engine_startup raise RuntimeError("Engine core initialization failed. " RuntimeError: Engine core initialization failed. See root cause above.

Is there anything I could do I am trying to update image from these issues but get errors on building and it take some time

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Jun 19 '25 13:06 lukaLLM