vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: llama 4 scout instruct does not support torch.compile

Open rdodev opened this issue 8 months ago • 16 comments

Your current environment

CUDA 12.6 Python 3.12 vllm 0.8.3 transformers 4.5.1 2 x h100

🐛 Describe the bug

I'm running version v0.8.3 of vllm and v4.5.1 transformers.

Trying to bootstrap meta-llama/Llama-4-Scout-17B-16E-Instruct with "fp8" quant, and 128K context length on 2 x H100. I keep receiving the following error:

[dckr]: (VllmWorker rank=0 pid=1004) WARNING 04-06 10:20:04 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E does not support it. Please open an issue on GitHub if you want it to be supported.
[dckr]: (VllmWorker rank=0 pid=1004) WARNING 04-06 10:20:04 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E does not support it. Please open an issue on GitHub if you want it to be supported.
[dckr]: (VllmWorker rank=1 pid=1031) WARNING 04-06 10:20:04 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E does not support it. Please open an issue on GitHub if you want it to be supported.
[dckr]: (VllmWorker rank=1 pid=1031) WARNING 04-06 10:20:04 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E does not support it. Please open an issue on GitHub if you want it to be supported.
[dckr]: (VllmWorker rank=1 pid=1031) Process SpawnProcess-1:2:
[dckr]: CRITICAL 04-06 10:20:04 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

rdodev avatar Apr 06 '25 19:04 rdodev

Can you show me the command you used to launch vllm? Thanks!

wukaixingxp avatar Apr 06 '25 21:04 wukaixingxp

Because the weights themselves aren't fp8 quantized yet, Scout can only run with bf16 weights. However, we do support dynamic quantization of the KV cache via --kv-cache-dtype fp8. Team from RedHat is working on post training quantization of the models (cc @mgoin @eldarkurtic).

simon-mo avatar Apr 06 '25 22:04 simon-mo

Because the weights themselves aren't fp8 quantized yet, Scout can only run with bf16 weights. However, we do support dynamic quantization of the KV cache via --kv-cache-dtype fp8. Team from RedHat is working on post training quantization of the models (cc @mgoin @eldarkurtic).

tyvm this solved the issue. Unfortunately this also means their model card in HF isn't accurate (which claims on-the-fly quant to fp8). This also means it needs at least 2 x H200 to run a this precision and 128k context length. 256k context creates CUDA OOM when mapping the kvcache.

rdodev avatar Apr 07 '25 00:04 rdodev

@rdodev is it possible for you to share the vllm arguments which you have used for your deployment.

rabaja avatar Apr 07 '25 16:04 rabaja

Hello everyone,

I tried this on 4xA100 (80GB):

VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct --tensor-parallel-size 4 --max-model-len 128000 --kv-cache-dtype fp8

and got the weird error message "init() got an unexpected keyword argument 'use_irope'". My vllm setting:

  • vllm ==0.8.3
  • transformers == 4.51.0

Does anyone know the reason? Thanks!

duyvuleo avatar Apr 08 '25 06:04 duyvuleo

I tried this on 8xRTX8000 (48GB):

vllm serve /input0/ --host 0.0.0.0 --port 8080 --dtype half --tensor-parallel-size 8 --gpu-memory-utilization 0.9 --max-model-len 65535 --served-model-name llama4-scout

  • vllm ==0.8.3
  • transformers == 4.51.0
TypeError: XFormersImpl.__init__() got an unexpected keyword argument 'use_irope'

Uhao-P avatar Apr 08 '25 12:04 Uhao-P

I am facing similar problems, despite I am disabling compile as blog suggest.

Command used

VLLM_DISABLE_COMPILE_CACHE=1  vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct     --tokenizer meta-llama/Llama-4-Scout-17B-16E-Instruct     --host "0.0.0.0"     --port 5000     --gpu-memory-utilization 0.99     --served-model-name "Llama-4-Scout-17B-16E-Instruct"     --max-num-batched-tokens 8192     --max-num-seqs 32     --max-model-len 8192   --limit-mm-per-prompt image=10   --quantization bitsandbytes   --load-format bitsandbytes

Error gotten

torch.compile is turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported.

Logs

INFO 04-08 07:27:37 [__init__.py:239] Automatically detected platform cuda.
INFO 04-08 07:27:37 [api_server.py:1034] vLLM API server version 0.8.3
INFO 04-08 07:27:37 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='meta-llama/Llama-4-Scout-17B-16E-Instruct', config='', host='0.0.0.0', port=5000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-4-Scout-17B-16E-Instruct', task='auto', tokenizer='meta-llama/Llama-4-Scout-17B-16E-Instruct', hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='bitsandbytes', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.99, num_gpu_blocks_override=None, max_num_batched_tokens=8192, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=32, max_logprobs=20, disable_log_stats=False, quantization='bitsandbytes', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'image': 10}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Llama-4-Scout-17B-16E-Instruct'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x724921b96b60>)
INFO 04-08 07:27:42 [config.py:600] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
WARNING 04-08 07:27:42 [config.py:679] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-08 07:27:42 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 04-08 07:27:45 [__init__.py:239] Automatically detected platform cuda.
INFO 04-08 07:27:46 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='meta-llama/Llama-4-Scout-17B-16E-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-4-Scout-17B-16E-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Llama-4-Scout-17B-16E-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-08 07:27:46 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x716447e1d070>
INFO 04-08 07:27:46 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-08 07:27:46 [cuda.py:221] Using Flash Attention backend on V1 engine.
INFO 04-08 07:27:50 [gpu_model_runner.py:1258] Starting to load model meta-llama/Llama-4-Scout-17B-16E-Instruct...
INFO 04-08 07:27:50 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
WARNING 04-08 07:27:50 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported.
WARNING 04-08 07:27:50 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported.
ERROR 04-08 07:27:50 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-08 07:27:50 [core.py:390]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-08 07:27:50 [core.py:390]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 319, in __init__
ERROR 04-08 07:27:50 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 67, in __init__
ERROR 04-08 07:27:50 [core.py:390]     self.model_executor = executor_class(vllm_config)
ERROR 04-08 07:27:50 [core.py:390]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-08 07:27:50 [core.py:390]     self._init_executor()
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 04-08 07:27:50 [core.py:390]     self.collective_rpc("load_model")
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-08 07:27:50 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-08 07:27:50 [core.py:390]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2347, in run_method
ERROR 04-08 07:27:50 [core.py:390]     return func(*args, **kwargs)
ERROR 04-08 07:27:50 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 136, in load_model
ERROR 04-08 07:27:50 [core.py:390]     self.model_runner.load_model()
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1261, in load_model
ERROR 04-08 07:27:50 [core.py:390]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 04-08 07:27:50 [core.py:390]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 04-08 07:27:50 [core.py:390]     return loader.load_model(vllm_config=vllm_config)
ERROR 04-08 07:27:50 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1278, in load_model
ERROR 04-08 07:27:50 [core.py:390]     model = _initialize_model(vllm_config=vllm_config)
ERROR 04-08 07:27:50 [core.py:390]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 127, in _initialize_model
ERROR 04-08 07:27:50 [core.py:390]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 04-08 07:27:50 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama4.py", line 713, in __init__
ERROR 04-08 07:27:50 [core.py:390]     self.language_model = init_vllm_registered_model(
ERROR 04-08 07:27:50 [core.py:390]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 286, in init_vllm_registered_model
ERROR 04-08 07:27:50 [core.py:390]     return _initialize_model(vllm_config=vllm_config, prefix=prefix)
ERROR 04-08 07:27:50 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 127, in _initialize_model
ERROR 04-08 07:27:50 [core.py:390]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 04-08 07:27:50 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 479, in __init__
ERROR 04-08 07:27:50 [core.py:390]     LlamaForCausalLM.__init__(self,
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 486, in __init__
ERROR 04-08 07:27:50 [core.py:390]     self.model = self._init_model(vllm_config=vllm_config,
ERROR 04-08 07:27:50 [core.py:390]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 488, in _init_model
ERROR 04-08 07:27:50 [core.py:390]     return Llama4Model(vllm_config=vllm_config,
ERROR 04-08 07:27:50 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__
ERROR 04-08 07:27:50 [core.py:390]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 334, in __init__
ERROR 04-08 07:27:50 [core.py:390]     super().__init__(vllm_config=vllm_config,
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__
ERROR 04-08 07:27:50 [core.py:390]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 321, in __init__
ERROR 04-08 07:27:50 [core.py:390]     self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 04-08 07:27:50 [core.py:390]                                                     ^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 610, in make_layers
ERROR 04-08 07:27:50 [core.py:390]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 04-08 07:27:50 [core.py:390]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 323, in <lambda>
ERROR 04-08 07:27:50 [core.py:390]     lambda prefix: layer_type(config=config,
ERROR 04-08 07:27:50 [core.py:390]                    ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 283, in __init__
ERROR 04-08 07:27:50 [core.py:390]     self.feed_forward = Llama4MoE(
ERROR 04-08 07:27:50 [core.py:390]                         ^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 73, in __init__
ERROR 04-08 07:27:50 [core.py:390]     self.experts = FusedMoE(
ERROR 04-08 07:27:50 [core.py:390]                    ^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 502, in __init__
ERROR 04-08 07:27:50 [core.py:390]     assert self.quant_method is not None
ERROR 04-08 07:27:50 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] AssertionError
ERROR 04-08 07:27:50 [core.py:390] 
CRITICAL 04-08 07:27:50 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

hdnh2006 avatar Apr 08 '25 14:04 hdnh2006

I am facing similar problems, despite I am disabling compile as blog suggest.

Command used

VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct --tokenizer meta-llama/Llama-4-Scout-17B-16E-Instruct --host "0.0.0.0" --port 5000 --gpu-memory-utilization 0.99 --served-model-name "Llama-4-Scout-17B-16E-Instruct" --max-num-batched-tokens 8192 --max-num-seqs 32 --max-model-len 8192 --limit-mm-per-prompt image=10 --quantization bitsandbytes --load-format bitsandbytes

Error gotten

torch.compile is turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported.

Logs

INFO 04-08 07:27:37 [__init__.py:239] Automatically detected platform cuda.
INFO 04-08 07:27:37 [api_server.py:1034] vLLM API server version 0.8.3
INFO 04-08 07:27:37 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='meta-llama/Llama-4-Scout-17B-16E-Instruct', config='', host='0.0.0.0', port=5000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-4-Scout-17B-16E-Instruct', task='auto', tokenizer='meta-llama/Llama-4-Scout-17B-16E-Instruct', hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='bitsandbytes', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.99, num_gpu_blocks_override=None, max_num_batched_tokens=8192, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=32, max_logprobs=20, disable_log_stats=False, quantization='bitsandbytes', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'image': 10}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Llama-4-Scout-17B-16E-Instruct'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x724921b96b60>)
INFO 04-08 07:27:42 [config.py:600] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
WARNING 04-08 07:27:42 [config.py:679] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-08 07:27:42 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 04-08 07:27:45 [__init__.py:239] Automatically detected platform cuda.
INFO 04-08 07:27:46 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='meta-llama/Llama-4-Scout-17B-16E-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-4-Scout-17B-16E-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Llama-4-Scout-17B-16E-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-08 07:27:46 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x716447e1d070>
INFO 04-08 07:27:46 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-08 07:27:46 [cuda.py:221] Using Flash Attention backend on V1 engine.
INFO 04-08 07:27:50 [gpu_model_runner.py:1258] Starting to load model meta-llama/Llama-4-Scout-17B-16E-Instruct...
INFO 04-08 07:27:50 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
WARNING 04-08 07:27:50 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported.
WARNING 04-08 07:27:50 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported.
ERROR 04-08 07:27:50 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-08 07:27:50 [core.py:390]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-08 07:27:50 [core.py:390]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 319, in __init__
ERROR 04-08 07:27:50 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 67, in __init__
ERROR 04-08 07:27:50 [core.py:390]     self.model_executor = executor_class(vllm_config)
ERROR 04-08 07:27:50 [core.py:390]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-08 07:27:50 [core.py:390]     self._init_executor()
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 04-08 07:27:50 [core.py:390]     self.collective_rpc("load_model")
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-08 07:27:50 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-08 07:27:50 [core.py:390]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2347, in run_method
ERROR 04-08 07:27:50 [core.py:390]     return func(*args, **kwargs)
ERROR 04-08 07:27:50 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 136, in load_model
ERROR 04-08 07:27:50 [core.py:390]     self.model_runner.load_model()
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1261, in load_model
ERROR 04-08 07:27:50 [core.py:390]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 04-08 07:27:50 [core.py:390]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 04-08 07:27:50 [core.py:390]     return loader.load_model(vllm_config=vllm_config)
ERROR 04-08 07:27:50 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1278, in load_model
ERROR 04-08 07:27:50 [core.py:390]     model = _initialize_model(vllm_config=vllm_config)
ERROR 04-08 07:27:50 [core.py:390]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 127, in _initialize_model
ERROR 04-08 07:27:50 [core.py:390]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 04-08 07:27:50 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama4.py", line 713, in __init__
ERROR 04-08 07:27:50 [core.py:390]     self.language_model = init_vllm_registered_model(
ERROR 04-08 07:27:50 [core.py:390]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 286, in init_vllm_registered_model
ERROR 04-08 07:27:50 [core.py:390]     return _initialize_model(vllm_config=vllm_config, prefix=prefix)
ERROR 04-08 07:27:50 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 127, in _initialize_model
ERROR 04-08 07:27:50 [core.py:390]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 04-08 07:27:50 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 479, in __init__
ERROR 04-08 07:27:50 [core.py:390]     LlamaForCausalLM.__init__(self,
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 486, in __init__
ERROR 04-08 07:27:50 [core.py:390]     self.model = self._init_model(vllm_config=vllm_config,
ERROR 04-08 07:27:50 [core.py:390]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 488, in _init_model
ERROR 04-08 07:27:50 [core.py:390]     return Llama4Model(vllm_config=vllm_config,
ERROR 04-08 07:27:50 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__
ERROR 04-08 07:27:50 [core.py:390]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 334, in __init__
ERROR 04-08 07:27:50 [core.py:390]     super().__init__(vllm_config=vllm_config,
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__
ERROR 04-08 07:27:50 [core.py:390]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 321, in __init__
ERROR 04-08 07:27:50 [core.py:390]     self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 04-08 07:27:50 [core.py:390]                                                     ^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 610, in make_layers
ERROR 04-08 07:27:50 [core.py:390]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 04-08 07:27:50 [core.py:390]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 323, in <lambda>
ERROR 04-08 07:27:50 [core.py:390]     lambda prefix: layer_type(config=config,
ERROR 04-08 07:27:50 [core.py:390]                    ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 283, in __init__
ERROR 04-08 07:27:50 [core.py:390]     self.feed_forward = Llama4MoE(
ERROR 04-08 07:27:50 [core.py:390]                         ^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 73, in __init__
ERROR 04-08 07:27:50 [core.py:390]     self.experts = FusedMoE(
ERROR 04-08 07:27:50 [core.py:390]                    ^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 502, in __init__
ERROR 04-08 07:27:50 [core.py:390]     assert self.quant_method is not None
ERROR 04-08 07:27:50 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 07:27:50 [core.py:390] AssertionError
ERROR 04-08 07:27:50 [core.py:390] 
CRITICAL 04-08 07:27:50 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

Remove quantization and quant format from command.

rdodev avatar Apr 08 '25 15:04 rdodev

@duyvuleo @Uhao-P for the use_irope error, my recent PR #16212 should have fixed it. could you rebase ahead of this PR?

sarckk avatar Apr 09 '25 03:04 sarckk

Remove quantization and quant format from command.

If I do this, the model will be loaded in full precision, so I will need a bigger GPU, this is not a solution of the problem I am facing, vLLM should support this quantization already, am I right?

hdnh2006 avatar Apr 09 '25 08:04 hdnh2006

Remove quantization and quant format from command.

If I do this, the model will be loaded in full precision, so I will need a bigger GPU, this is not a solution of the problem I am facing, vLLM should support this quantization already, am I right?

It is the only solution until quantized weights are released. It's not a vllm issue. It's a llama4 issue.

rdodev avatar Apr 09 '25 13:04 rdodev

Here is FP8 quantized Llama-4-Scout model: https://huggingface.co/RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic

eldarkurtic avatar Apr 10 '25 11:04 eldarkurtic

Make sure to be on the latest version of vLLM since there were some recent bug fixes for Scout model specifically (i.e. ec7da6fcf32fc05efe5d7ba30d01d3d940f12a3c)

eldarkurtic avatar Apr 10 '25 11:04 eldarkurtic

Facing the same issue using command

CUDA_VISIBLE_DEVICE=2,3,4,5 VLLM_DISABLE_COMPILE_CACHE=1 vllm serve /model/Llama-4-Scout-17B-16E-Instruct/ --device cuda --override-generation-config='{"attn_temperature_tuning": true}' --max-model-len 524288 --tensor-parallel-size 4 --host 0.0.0.0 --port 8006

Specs: 4 x A100 (80GB) Vllm version: v0.8.3

harishd1998 avatar Apr 11 '25 05:04 harishd1998

Here is FP8 quantized Llama-4-Scout model: https://huggingface.co/RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic

Hello, were you able to get this working? If so, what was your vllm server command?

Loc8888 avatar Apr 11 '25 17:04 Loc8888

For example:

vllm serve RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic -tp NUM_GPUS —max-model-len 16384

works fine on my end (make sure to use latest vllm)

eldarkurtic avatar Apr 11 '25 18:04 eldarkurtic

@rdodev could you confirm the issue is resolved on v0.8.5?

yeqcharlotte avatar May 10 '25 04:05 yeqcharlotte

@yeqcharlotte the issue has been resolved.

rdodev avatar May 10 '25 13:05 rdodev