[Bug]: llama 4 scout instruct does not support torch.compile
Your current environment
CUDA 12.6 Python 3.12 vllm 0.8.3 transformers 4.5.1 2 x h100
🐛 Describe the bug
I'm running version v0.8.3 of vllm and v4.5.1 transformers.
Trying to bootstrap meta-llama/Llama-4-Scout-17B-16E-Instruct with "fp8" quant, and 128K context length on 2 x H100. I keep receiving the following error:
[dckr]: (VllmWorker rank=0 pid=1004) WARNING 04-06 10:20:04 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E does not support it. Please open an issue on GitHub if you want it to be supported.
[dckr]: (VllmWorker rank=0 pid=1004) WARNING 04-06 10:20:04 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E does not support it. Please open an issue on GitHub if you want it to be supported.
[dckr]: (VllmWorker rank=1 pid=1031) WARNING 04-06 10:20:04 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E does not support it. Please open an issue on GitHub if you want it to be supported.
[dckr]: (VllmWorker rank=1 pid=1031) WARNING 04-06 10:20:04 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E does not support it. Please open an issue on GitHub if you want it to be supported.
[dckr]: (VllmWorker rank=1 pid=1031) Process SpawnProcess-1:2:
[dckr]: CRITICAL 04-06 10:20:04 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
### Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.
Can you show me the command you used to launch vllm? Thanks!
Because the weights themselves aren't fp8 quantized yet, Scout can only run with bf16 weights. However, we do support dynamic quantization of the KV cache via --kv-cache-dtype fp8. Team from RedHat is working on post training quantization of the models (cc @mgoin @eldarkurtic).
Because the weights themselves aren't fp8 quantized yet, Scout can only run with bf16 weights. However, we do support dynamic quantization of the KV cache via
--kv-cache-dtype fp8. Team from RedHat is working on post training quantization of the models (cc @mgoin @eldarkurtic).
tyvm this solved the issue. Unfortunately this also means their model card in HF isn't accurate (which claims on-the-fly quant to fp8). This also means it needs at least 2 x H200 to run a this precision and 128k context length. 256k context creates CUDA OOM when mapping the kvcache.
@rdodev is it possible for you to share the vllm arguments which you have used for your deployment.
Hello everyone,
I tried this on 4xA100 (80GB):
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct --tensor-parallel-size 4 --max-model-len 128000 --kv-cache-dtype fp8
and got the weird error message "init() got an unexpected keyword argument 'use_irope'". My vllm setting:
- vllm ==0.8.3
- transformers == 4.51.0
Does anyone know the reason? Thanks!
I tried this on 8xRTX8000 (48GB):
vllm serve /input0/ --host 0.0.0.0 --port 8080 --dtype half --tensor-parallel-size 8 --gpu-memory-utilization 0.9 --max-model-len 65535 --served-model-name llama4-scout
- vllm ==0.8.3
- transformers == 4.51.0
TypeError: XFormersImpl.__init__() got an unexpected keyword argument 'use_irope'
I am facing similar problems, despite I am disabling compile as blog suggest.
Command used
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct --tokenizer meta-llama/Llama-4-Scout-17B-16E-Instruct --host "0.0.0.0" --port 5000 --gpu-memory-utilization 0.99 --served-model-name "Llama-4-Scout-17B-16E-Instruct" --max-num-batched-tokens 8192 --max-num-seqs 32 --max-model-len 8192 --limit-mm-per-prompt image=10 --quantization bitsandbytes --load-format bitsandbytes
Error gotten
torch.compile is turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported.
Logs
INFO 04-08 07:27:37 [__init__.py:239] Automatically detected platform cuda.
INFO 04-08 07:27:37 [api_server.py:1034] vLLM API server version 0.8.3
INFO 04-08 07:27:37 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='meta-llama/Llama-4-Scout-17B-16E-Instruct', config='', host='0.0.0.0', port=5000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-4-Scout-17B-16E-Instruct', task='auto', tokenizer='meta-llama/Llama-4-Scout-17B-16E-Instruct', hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='bitsandbytes', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.99, num_gpu_blocks_override=None, max_num_batched_tokens=8192, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=32, max_logprobs=20, disable_log_stats=False, quantization='bitsandbytes', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'image': 10}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Llama-4-Scout-17B-16E-Instruct'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x724921b96b60>)
INFO 04-08 07:27:42 [config.py:600] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
WARNING 04-08 07:27:42 [config.py:679] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-08 07:27:42 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 04-08 07:27:45 [__init__.py:239] Automatically detected platform cuda.
INFO 04-08 07:27:46 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='meta-llama/Llama-4-Scout-17B-16E-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-4-Scout-17B-16E-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Llama-4-Scout-17B-16E-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-08 07:27:46 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x716447e1d070>
INFO 04-08 07:27:46 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-08 07:27:46 [cuda.py:221] Using Flash Attention backend on V1 engine.
INFO 04-08 07:27:50 [gpu_model_runner.py:1258] Starting to load model meta-llama/Llama-4-Scout-17B-16E-Instruct...
INFO 04-08 07:27:50 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
WARNING 04-08 07:27:50 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported.
WARNING 04-08 07:27:50 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported.
ERROR 04-08 07:27:50 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-08 07:27:50 [core.py:390] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 319, in __init__
ERROR 04-08 07:27:50 [core.py:390] super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 67, in __init__
ERROR 04-08 07:27:50 [core.py:390] self.model_executor = executor_class(vllm_config)
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-08 07:27:50 [core.py:390] self._init_executor()
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 04-08 07:27:50 [core.py:390] self.collective_rpc("load_model")
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-08 07:27:50 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2347, in run_method
ERROR 04-08 07:27:50 [core.py:390] return func(*args, **kwargs)
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 136, in load_model
ERROR 04-08 07:27:50 [core.py:390] self.model_runner.load_model()
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1261, in load_model
ERROR 04-08 07:27:50 [core.py:390] self.model = get_model(vllm_config=self.vllm_config)
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 04-08 07:27:50 [core.py:390] return loader.load_model(vllm_config=vllm_config)
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1278, in load_model
ERROR 04-08 07:27:50 [core.py:390] model = _initialize_model(vllm_config=vllm_config)
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 127, in _initialize_model
ERROR 04-08 07:27:50 [core.py:390] return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama4.py", line 713, in __init__
ERROR 04-08 07:27:50 [core.py:390] self.language_model = init_vllm_registered_model(
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 286, in init_vllm_registered_model
ERROR 04-08 07:27:50 [core.py:390] return _initialize_model(vllm_config=vllm_config, prefix=prefix)
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 127, in _initialize_model
ERROR 04-08 07:27:50 [core.py:390] return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 479, in __init__
ERROR 04-08 07:27:50 [core.py:390] LlamaForCausalLM.__init__(self,
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 486, in __init__
ERROR 04-08 07:27:50 [core.py:390] self.model = self._init_model(vllm_config=vllm_config,
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 488, in _init_model
ERROR 04-08 07:27:50 [core.py:390] return Llama4Model(vllm_config=vllm_config,
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__
ERROR 04-08 07:27:50 [core.py:390] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 334, in __init__
ERROR 04-08 07:27:50 [core.py:390] super().__init__(vllm_config=vllm_config,
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__
ERROR 04-08 07:27:50 [core.py:390] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 321, in __init__
ERROR 04-08 07:27:50 [core.py:390] self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 610, in make_layers
ERROR 04-08 07:27:50 [core.py:390] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 323, in <lambda>
ERROR 04-08 07:27:50 [core.py:390] lambda prefix: layer_type(config=config,
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 283, in __init__
ERROR 04-08 07:27:50 [core.py:390] self.feed_forward = Llama4MoE(
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 73, in __init__
ERROR 04-08 07:27:50 [core.py:390] self.experts = FusedMoE(
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 502, in __init__
ERROR 04-08 07:27:50 [core.py:390] assert self.quant_method is not None
ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] AssertionError
ERROR 04-08 07:27:50 [core.py:390]
CRITICAL 04-08 07:27:50 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed
I am facing similar problems, despite I am disabling compile as blog suggest.
Command used
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct --tokenizer meta-llama/Llama-4-Scout-17B-16E-Instruct --host "0.0.0.0" --port 5000 --gpu-memory-utilization 0.99 --served-model-name "Llama-4-Scout-17B-16E-Instruct" --max-num-batched-tokens 8192 --max-num-seqs 32 --max-model-len 8192 --limit-mm-per-prompt image=10 --quantization bitsandbytes --load-format bitsandbytes
Error gotten
torch.compileis turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported.Logs
INFO 04-08 07:27:37 [__init__.py:239] Automatically detected platform cuda. INFO 04-08 07:27:37 [api_server.py:1034] vLLM API server version 0.8.3 INFO 04-08 07:27:37 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='meta-llama/Llama-4-Scout-17B-16E-Instruct', config='', host='0.0.0.0', port=5000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-4-Scout-17B-16E-Instruct', task='auto', tokenizer='meta-llama/Llama-4-Scout-17B-16E-Instruct', hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='bitsandbytes', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.99, num_gpu_blocks_override=None, max_num_batched_tokens=8192, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=32, max_logprobs=20, disable_log_stats=False, quantization='bitsandbytes', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'image': 10}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Llama-4-Scout-17B-16E-Instruct'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x724921b96b60>) INFO 04-08 07:27:42 [config.py:600] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate', 'score'}. Defaulting to 'generate'. WARNING 04-08 07:27:42 [config.py:679] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 04-08 07:27:42 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192. INFO 04-08 07:27:45 [__init__.py:239] Automatically detected platform cuda. INFO 04-08 07:27:46 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='meta-llama/Llama-4-Scout-17B-16E-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-4-Scout-17B-16E-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Llama-4-Scout-17B-16E-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512} WARNING 04-08 07:27:46 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x716447e1d070> INFO 04-08 07:27:46 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0 INFO 04-08 07:27:46 [cuda.py:221] Using Flash Attention backend on V1 engine. INFO 04-08 07:27:50 [gpu_model_runner.py:1258] Starting to load model meta-llama/Llama-4-Scout-17B-16E-Instruct... INFO 04-08 07:27:50 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376] WARNING 04-08 07:27:50 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported. WARNING 04-08 07:27:50 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported. ERROR 04-08 07:27:50 [core.py:390] EngineCore hit an exception: Traceback (most recent call last): ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 378, in run_engine_core ERROR 04-08 07:27:50 [core.py:390] engine_core = EngineCoreProc(*args, **kwargs) ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 319, in __init__ ERROR 04-08 07:27:50 [core.py:390] super().__init__(vllm_config, executor_class, log_stats) ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 67, in __init__ ERROR 04-08 07:27:50 [core.py:390] self.model_executor = executor_class(vllm_config) ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__ ERROR 04-08 07:27:50 [core.py:390] self._init_executor() ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor ERROR 04-08 07:27:50 [core.py:390] self.collective_rpc("load_model") ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc ERROR 04-08 07:27:50 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs) ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2347, in run_method ERROR 04-08 07:27:50 [core.py:390] return func(*args, **kwargs) ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 136, in load_model ERROR 04-08 07:27:50 [core.py:390] self.model_runner.load_model() ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1261, in load_model ERROR 04-08 07:27:50 [core.py:390] self.model = get_model(vllm_config=self.vllm_config) ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model ERROR 04-08 07:27:50 [core.py:390] return loader.load_model(vllm_config=vllm_config) ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1278, in load_model ERROR 04-08 07:27:50 [core.py:390] model = _initialize_model(vllm_config=vllm_config) ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 127, in _initialize_model ERROR 04-08 07:27:50 [core.py:390] return model_class(vllm_config=vllm_config, prefix=prefix) ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama4.py", line 713, in __init__ ERROR 04-08 07:27:50 [core.py:390] self.language_model = init_vllm_registered_model( ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 286, in init_vllm_registered_model ERROR 04-08 07:27:50 [core.py:390] return _initialize_model(vllm_config=vllm_config, prefix=prefix) ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 127, in _initialize_model ERROR 04-08 07:27:50 [core.py:390] return model_class(vllm_config=vllm_config, prefix=prefix) ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 479, in __init__ ERROR 04-08 07:27:50 [core.py:390] LlamaForCausalLM.__init__(self, ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 486, in __init__ ERROR 04-08 07:27:50 [core.py:390] self.model = self._init_model(vllm_config=vllm_config, ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 488, in _init_model ERROR 04-08 07:27:50 [core.py:390] return Llama4Model(vllm_config=vllm_config, ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__ ERROR 04-08 07:27:50 [core.py:390] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs) ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 334, in __init__ ERROR 04-08 07:27:50 [core.py:390] super().__init__(vllm_config=vllm_config, ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__ ERROR 04-08 07:27:50 [core.py:390] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs) ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 321, in __init__ ERROR 04-08 07:27:50 [core.py:390] self.start_layer, self.end_layer, self.layers = make_layers( ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 610, in make_layers ERROR 04-08 07:27:50 [core.py:390] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}")) ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 323, in <lambda> ERROR 04-08 07:27:50 [core.py:390] lambda prefix: layer_type(config=config, ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 283, in __init__ ERROR 04-08 07:27:50 [core.py:390] self.feed_forward = Llama4MoE( ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 73, in __init__ ERROR 04-08 07:27:50 [core.py:390] self.experts = FusedMoE( ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 502, in __init__ ERROR 04-08 07:27:50 [core.py:390] assert self.quant_method is not None ERROR 04-08 07:27:50 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-08 07:27:50 [core.py:390] AssertionError ERROR 04-08 07:27:50 [core.py:390] CRITICAL 04-08 07:27:50 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue. Killed
Remove quantization and quant format from command.
@duyvuleo @Uhao-P for the use_irope error, my recent PR #16212 should have fixed it. could you rebase ahead of this PR?
Remove quantization and quant format from command.
If I do this, the model will be loaded in full precision, so I will need a bigger GPU, this is not a solution of the problem I am facing, vLLM should support this quantization already, am I right?
Remove quantization and quant format from command.
If I do this, the model will be loaded in full precision, so I will need a bigger GPU, this is not a solution of the problem I am facing, vLLM should support this quantization already, am I right?
It is the only solution until quantized weights are released. It's not a vllm issue. It's a llama4 issue.
Here is FP8 quantized Llama-4-Scout model: https://huggingface.co/RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
Make sure to be on the latest version of vLLM since there were some recent bug fixes for Scout model specifically (i.e. ec7da6fcf32fc05efe5d7ba30d01d3d940f12a3c)
Facing the same issue using command
CUDA_VISIBLE_DEVICE=2,3,4,5 VLLM_DISABLE_COMPILE_CACHE=1 vllm serve /model/Llama-4-Scout-17B-16E-Instruct/ --device cuda --override-generation-config='{"attn_temperature_tuning": true}' --max-model-len 524288 --tensor-parallel-size 4 --host 0.0.0.0 --port 8006
Specs: 4 x A100 (80GB) Vllm version: v0.8.3
Here is FP8 quantized Llama-4-Scout model: https://huggingface.co/RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
Hello, were you able to get this working? If so, what was your vllm server command?
For example:
vllm serve RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic -tp NUM_GPUS —max-model-len 16384
works fine on my end (make sure to use latest vllm)
@rdodev could you confirm the issue is resolved on v0.8.5?
@yeqcharlotte the issue has been resolved.