vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature]: torch compile for llama4 unsloth

Open ErykCh opened this issue 8 months ago • 8 comments

🚀 The feature, motivation and pitch

Hi,

torch.compile is turned on, but the model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-dynamic-bnb-4bit does not support it. Please open an issue on GitHub if you want it to be supported.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

ErykCh avatar Apr 07 '25 07:04 ErykCh

@houseroad, how would you suggest us to quantize Scout to FP8? Using --quantization fp8 doesn't work.

varad0309 avatar Apr 07 '25 21:04 varad0309

@varad0309 can you try --quantization fp8?

sarckk avatar Apr 07 '25 23:04 sarckk

@varad0309 can you try --quantization fp8?

Hey @sarckk , sorry, I did mean quantization, that was a typo

varad0309 avatar Apr 07 '25 23:04 varad0309

INFO 04-07 16:45:57 [__init__.py:239] Automatically detected platform cuda.
INFO 04-07 16:46:00 [api_server.py:1034] vLLM API server version 0.8.3
INFO 04-07 16:46:00 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='meta-llama/Llama-4-Scout-17B-16E-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-4-Scout-17B-16E-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=128000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=2048, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=32, max_logprobs=20, disable_log_stats=False, quantization='fp8', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f266794d1c0>)
INFO 04-07 16:46:07 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
INFO 04-07 16:46:07 [config.py:1600] Defaulting to use mp for distributed inference
INFO 04-07 16:46:07 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-07 16:46:13 [__init__.py:239] Automatically detected platform cuda.
INFO 04-07 16:46:16 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='meta-llama/Llama-4-Scout-17B-16E-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-4-Scout-17B-16E-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=meta-llama/Llama-4-Scout-17B-16E-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-07 16:46:16 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-07 16:46:16 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_d60c8d11'), local_subscribe_addr='ipc:///tmp/d312e283-588c-46fd-8e19-1c2a2a18118d', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-07 16:46:19 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-07 16:46:22 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f4694faa120>
(VllmWorker rank=0 pid=223) INFO 04-07 16:46:22 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_53e235ca'), local_subscribe_addr='ipc:///tmp/41f585f2-ca18-4a3e-8020-05ebceec4eef', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-07 16:46:25 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-07 16:46:28 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f10c14b7950>
(VllmWorker rank=1 pid=240) INFO 04-07 16:46:28 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_95676dd1'), local_subscribe_addr='ipc:///tmp/765a74a3-0999-44ea-8d3f-f05417b23471', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=240) INFO 04-07 16:46:28 [utils.py:990] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=240) INFO 04-07 16:46:28 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=223) INFO 04-07 16:46:28 [utils.py:990] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=223) INFO 04-07 16:46:28 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=223) INFO 04-07 16:46:28 [custom_all_reduce_utils.py:206] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=223) INFO 04-07 16:46:42 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=1 pid=240) INFO 04-07 16:46:42 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=223) INFO 04-07 16:46:42 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_c4b17dfb'), local_subscribe_addr='ipc:///tmp/16606fd1-6cc5-4a5e-a75f-1692326876fd', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=240) INFO 04-07 16:46:42 [parallel_state.py:957] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=0 pid=223) INFO 04-07 16:46:42 [parallel_state.py:957] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=1 pid=240) INFO 04-07 16:46:42 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=223) INFO 04-07 16:46:42 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=223) INFO 04-07 16:46:48 [gpu_model_runner.py:1258] Starting to load model meta-llama/Llama-4-Scout-17B-16E-Instruct...
(VllmWorker rank=1 pid=240) INFO 04-07 16:46:48 [gpu_model_runner.py:1258] Starting to load model meta-llama/Llama-4-Scout-17B-16E-Instruct...
(VllmWorker rank=0 pid=223) INFO 04-07 16:46:48 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=1 pid=240) INFO 04-07 16:46:48 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=0 pid=223) WARNING 04-07 16:46:48 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=223) WARNING 04-07 16:46:48 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=240) WARNING 04-07 16:46:48 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=240) WARNING 04-07 16:46:48 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=223) Process SpawnProcess-1:1:
CRITICAL 04-07 16:46:49 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-07 16:46:49 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

varad0309 avatar Apr 07 '25 23:04 varad0309

You can try to remove the torch.compile cache and see if it causes any difference. Or try VLLM_DISABLE_COMPILE_CACHE=1 to disable torch compile cache.

Likely it's not due torch.compile, but another reason. Seems the stack trace got lost in the middle.

houseroad avatar Apr 08 '25 05:04 houseroad

The same question,

INFO 04-08 14:24:51 [__init__.py:239] Automatically detected platform cuda. INFO 04-08 14:24:59 [api_server.py:1034] vLLM API server version 0.8.3 INFO 04-08 14:24:59 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='/mnt/hwfile/trustai/share/models/meta-llama/Llama-4-Scout-17B-16E-Instruct', config='', host=None, port=10000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/hwfile/trustai/share/models/meta-llama/Llama-4-Scout-17B-16E-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f0644cf2ca0>) INFO 04-08 14:25:40 [config.py:600] This model supports multiple tasks: {'classify', 'generate', 'score', 'embed', 'reward'}. Defaulting to 'generate'. INFO 04-08 14:25:41 [config.py:1600] Defaulting to use mp for distributed inference INFO 04-08 14:25:41 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=2048. INFO 04-08 14:26:12 [__init__.py:239] Automatically detected platform cuda. INFO 04-08 14:26:24 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/hwfile/trustai/share/models/meta-llama/Llama-4-Scout-17B-16E-Instruct', speculative_config=None, tokenizer='/mnt/hwfile/trustai/share/models/meta-llama/Llama-4-Scout-17B-16E-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10485760, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/hwfile/trustai/share/models/meta-llama/Llama-4-Scout-17B-16E-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512} WARNING 04-08 14:26:24 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 04-08 14:26:24 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_8aabf5a8'), local_subscribe_addr='ipc:///tmp/b2d88675-6615-4206-96a7-6d8b58ffe05d', remote_subscribe_addr=None, remote_addr_ipv6=False) INFO 04-08 14:26:44 [__init__.py:239] Automatically detected platform cuda. WARNING 04-08 14:26:50 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7ffaecffe480> (VllmWorker rank=0 pid=170935) INFO 04-08 14:26:51 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_1fcba8d2'), local_subscribe_addr='ipc:///tmp/86271b7d-e816-41b1-96f2-dec446ba88c4', remote_subscribe_addr=None, remote_addr_ipv6=False) INFO 04-08 14:27:03 [__init__.py:239] Automatically detected platform cuda. WARNING 04-08 14:27:09 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fbf91af28d0> (VllmWorker rank=1 pid=171494) INFO 04-08 14:27:09 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_d1dd3dcb'), local_subscribe_addr='ipc:///tmp/f5f546f3-cd7a-4e8f-9044-15db8b5e1db9', remote_subscribe_addr=None, remote_addr_ipv6=False) (VllmWorker rank=0 pid=170935) INFO 04-08 14:27:10 [utils.py:990] Found nccl from library libnccl.so.2 (VllmWorker rank=0 pid=170935) INFO 04-08 14:27:10 [pynccl.py:69] vLLM is using nccl==2.21.5 (VllmWorker rank=1 pid=171494) INFO 04-08 14:27:10 [utils.py:990] Found nccl from library libnccl.so.2 (VllmWorker rank=1 pid=171494) INFO 04-08 14:27:10 [pynccl.py:69] vLLM is using nccl==2.21.5 (VllmWorker rank=1 pid=171494) INFO 04-08 14:27:10 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /mnt/petrelfs/zhupengyu1/.cache/vllm/gpu_p2p_access_cache_for_4,5,6,7.json (VllmWorker rank=0 pid=170935) INFO 04-08 14:27:10 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /mnt/petrelfs/zhupengyu1/.cache/vllm/gpu_p2p_access_cache_for_4,5,6,7.json (VllmWorker rank=0 pid=170935) INFO 04-08 14:27:10 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_bb129e18'), local_subscribe_addr='ipc:///tmp/bf594ce2-2c21-4d4c-831a-17f698071651', remote_subscribe_addr=None, remote_addr_ipv6=False) (VllmWorker rank=0 pid=170935) INFO 04-08 14:27:10 [parallel_state.py:957] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0 (VllmWorker rank=0 pid=170935) INFO 04-08 14:27:10 [cuda.py:221] Using Flash Attention backend on V1 engine. (VllmWorker rank=1 pid=171494) INFO 04-08 14:27:10 [parallel_state.py:957] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1 (VllmWorker rank=1 pid=171494) INFO 04-08 14:27:10 [cuda.py:221] Using Flash Attention backend on V1 engine. (VllmWorker rank=0 pid=170935) INFO 04-08 14:27:29 [gpu_model_runner.py:1258] Starting to load model /mnt/hwfile/trustai/share/models/meta-llama/Llama-4-Scout-17B-16E-Instruct... (VllmWorker rank=1 pid=171494) INFO 04-08 14:27:29 [gpu_model_runner.py:1258] Starting to load model /mnt/hwfile/trustai/share/models/meta-llama/Llama-4-Scout-17B-16E-Instruct... (VllmWorker rank=1 pid=171494) INFO 04-08 14:27:29 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376] (VllmWorker rank=0 pid=170935) INFO 04-08 14:27:29 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376] (VllmWorker rank=1 pid=171494) WARNING 04-08 14:27:29 [config.py:3785] torch.compileis turned on, but the model /mnt/hwfile/trustai/share/models/meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported. (VllmWorker rank=0 pid=170935) WARNING 04-08 14:27:29 [config.py:3785]torch.compileis turned on, but the model /mnt/hwfile/trustai/share/models/meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported. (VllmWorker rank=1 pid=171494) WARNING 04-08 14:27:29 [config.py:3785]torch.compileis turned on, but the model /mnt/hwfile/trustai/share/models/meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported. (VllmWorker rank=0 pid=170935) WARNING 04-08 14:27:29 [config.py:3785]torch.compile is turned on, but the model /mnt/hwfile/trustai/share/models/meta-llama/Llama-4-Scout-17B-16E-Instruct does not support it. Please open an issue on GitHub if you want it to be supported. (VllmWorker rank=1 pid=171494) Process SpawnProcess-1:2: CRITICAL 04-08 14:27:29 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue. CRITICAL 04-08 14:27:29 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue. run.sh: line 57: 160437 Killed vllm serve /mnt/hwfile/trustai/share/models/meta-llama/Llama-4-Scout-17B-16E-Instruct --port 10000 --tensor-parallel-size 2

whfeLingYu avatar Apr 08 '25 06:04 whfeLingYu

Or try VLLM_DISABLE_COMPILE_CACHE=1 to disable torch compile cache.

I tried yesterday but it doesn't work.

This is my configuration, but it didn't work with different combinations either, so I think that's a bigger problem.

docker run --runtime nvidia --gpus all -d --name vllm-Llama-Scout --restart unless-stopped -v ~/.cache/vllm:/root/.cache/vllm -v ~/.cache/huggingface:/root/.cache/huggingface -e VLLM_DISABLE_COMPILE_CACHE=1 -e RAY_ROTATION_MAX_BYTES=10241024 -e RAY_ROTATION_BACKUP_COUNT=1 -p 8000:8000 vllm/vllm-openai:v0.8.3 --model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-dynamic-bnb-4bit --served-model-name llm --compilation-config 0

docker run --runtime nvidia --gpus all -d --name vllm-Llama-Scout --restart unless-stopped -v ~/.cache/vllm:/root/.cache/vllm -v ~/.cache/huggingface:/root/.cache/huggingface -e VLLM_DISABLE_COMPILE_CACHE=1 -e RAY_ROTATION_MAX_BYTES=10241024 -e RAY_ROTATION_BACKUP_COUNT=1 -p 8000:8000 vllm/vllm-openai:v0.8.3 --model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-dynamic-bnb-4bit --served-model-name llm --compilation-config 0 --load-format bitsandbytes --quantization bitsandbytes

and I don't have a gpu to run on fp16, because that's probably the only version that works, from what I've seen people with fp8 from unsloth have the same problem

ErykCh avatar Apr 08 '25 10:04 ErykCh

Sorry, didn't get a chance to test this till today! Thanks for the suggestions @houseroad ! As you guessed, it didn't work. For your context, I was testing this (quantization fp8) on A100s with 2 GPUs. Increasing the number of GPUs to 4 doesn't result in encountering this issue, as it goes ahead with loading the checkpoints (but I encountered a new issue).

So, I am assuming that the previous error had something to do with OOM, but the stack trace was not reflecting it. I am assuming that following this thread: https://github.com/vllm-project/vllm/issues/16114#issuecomment-2786493126. the compatibility for Scout (MoE architectures) on A100 FP8 is not yet out?

Some noob questions:

  • Is it fair to assume that we have to do away with using only bf16 on A100s?
  • If yes, what would we have to do to enable compatibility for fp8 on A100s, assuming this is feasible? If someone is able to point out some directions to dig into, I would love to help.

cc: @sarckk @mgoin

ERROR 04-12 01:16:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-12 01:16:44 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-12 01:16:44 [core.py:390]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-12 01:16:44 [core.py:390]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-12 01:16:44 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 319, in __init__
ERROR 04-12 01:16:44 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-12 01:16:44 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 04-12 01:16:44 [core.py:390]     self._initialize_kv_caches(vllm_config)
ERROR 04-12 01:16:44 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 132, in _initialize_kv_caches
ERROR 04-12 01:16:44 [core.py:390]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 04-12 01:16:44 [core.py:390]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-12 01:16:44 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 66, in determine_available_memory
ERROR 04-12 01:16:44 [core.py:390]     output = self.collective_rpc("determine_available_memory")
ERROR 04-12 01:16:44 [core.py:390]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-12 01:16:44 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 134, in collective_rpc
ERROR 04-12 01:16:44 [core.py:390]     raise e
ERROR 04-12 01:16:44 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 123, in collective_rpc
ERROR 04-12 01:16:44 [core.py:390]     raise result
ERROR 04-12 01:16:44 [core.py:390] triton.compiler.errors.CompilationError: at 1:0:
ERROR 04-12 01:16:44 [core.py:390] def fused_moe_kernel(
ERROR 04-12 01:16:44 [core.py:390] ^
ERROR 04-12 01:16:44 [core.py:390] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")

varad0309 avatar Apr 12 '25 08:04 varad0309

I have this error on L40S so Ada Lovelace architecture.

ErykCh avatar Apr 14 '25 11:04 ErykCh

As the stacktrace notes, this is unfortunately an issue with triton type support. I recommend opening up an issue there asking if type conversion support can be expanded

mgoin avatar Apr 14 '25 14:04 mgoin

I fixed the confusing warning, but it looks like this issue is really about "type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')". Should we create a new issue for that or retitle the current issue? (I don't have permissions to do the retitle).

zou3519 avatar Apr 15 '25 13:04 zou3519

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Jul 15 '25 02:07 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Aug 14 '25 02:08 github-actions[bot]