verl
verl copied to clipboard
async rollout fails with 'NoneType' object has no attribute 'result' error
System Info
root@pool0-01705:~/src/verl_main_fp8/verl# python scripts/diagnose.py
----------Python Info----------
Version : 3.12.12
Compiler : GCC 11.4.0
Build : ('main', 'Oct 10 2025 08:52:57')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
Version : 25.3
Directory : /usr/local/lib/python3.12/dist-packages/pip
vllm : 0.11.2+cu129
sglang : not found.
ray : 2.51.1
torch : 2.9.0+cu129
----------verl Info-----------
Version : 0.7.0.dev
Directory : /lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl
Commit Hash : 152ca54fa7e68ce03b3b960c793d2a7a5becf8de
----------Platform Info----------
Platform : Linux-5.15.0-1063-nvidia-x86_64-with-glibc2.35
system : Linux
node : pool0-01705
release : 5.15.0-1063-nvidia
version : #64-Ubuntu SMP Fri Aug 9 17:13:45 UTC 2024
----------Environment----------
CUDA Runtime : 12.9
CUDA Compiler : Cuda compilation tools, release 12.9, V12.9.86
----------System Info----------
CPU Memory : 2015.54 GB
GPU Count : 8
GPU 1 Type : NVIDIA H100 80GB HBM3
GPU 1 Memory : 79.65 GB
GPU 2 Type : NVIDIA H100 80GB HBM3
GPU 2 Memory : 79.65 GB
GPU 3 Type : NVIDIA H100 80GB HBM3
GPU 3 Memory : 79.65 GB
GPU 4 Type : NVIDIA H100 80GB HBM3
GPU 4 Memory : 79.65 GB
GPU 5 Type : NVIDIA H100 80GB HBM3
GPU 5 Memory : 79.65 GB
GPU 6 Type : NVIDIA H100 80GB HBM3
GPU 6 Memory : 79.65 GB
GPU 7 Type : NVIDIA H100 80GB HBM3
GPU 7 Memory : 79.65 GB
GPU 8 Type : NVIDIA H100 80GB HBM3
GPU 8 Memory : 79.65 GB
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
I encountered the following problem with the verlai/verl:vllm011.2.dev3 image from https://hub.docker.com/r/verlai/verl/tags
Problem can be recreated using the examples/grpo_trainer/run_qwen2-7b.sh example. latest code from main. Async rollout fails with error:
(vLLMHttpServer pid=956140) vllm version is 0.11.1 or higher, call init_app_state with 3 parameters
(vLLMHttpServer pid=956140) WARNING 11-26 05:13:30 [model.py:1568] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(vLLMHttpServer pid=956139) INFO:2025-11-26 05:13:32,072:Initializing a V1 LLM engine with config: model='Qwen/Qwen2-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1536, download_dir=None, load_format=dummy, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2-7B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None}
(TaskRunner pid=950199) AgentLoopManager: ['10.65.31.151:27177', '10.65.31.151:24475', '10.65.31.151:30833', '10.65.31.151:15975']
(TaskRunner pid=950199) Checkpoint tracker file does not exist: /lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/checkpoints/verl_grpo_example_gsm8k/qwen2_7b_function_rm/latest_checkpointed_iteration.txt
(TaskRunner pid=950199) Training from scratch
(TaskRunner pid=950199) test_gen_batch meta info: {'eos_token_id': 151645, 'pad_token_id': 151643, 'recompute_log_prob': False, 'do_sample': False, 'validate': True, 'global_steps': 0}
(pid=957185) W1126 05:13:38.672000 957185 torch/utils/cpp_extension.py:117] No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
(pid=957185) WARNING:2025-11-26 05:13:38,703:fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) Process EngineCore_DP0:
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) Traceback (most recent call last):
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) self.run()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) self._target(*self._args, **self._kwargs)
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 846, in run_engine_core
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) raise e
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 835, in run_engine_core
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) engine_core.run_busy_loop()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 862, in run_busy_loop
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) self._process_engine_step()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 891, in _process_engine_step
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) outputs, model_executed = self.step_fn()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ^^^^^^^^^^^^^^
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 342, in step
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) model_output = future.result()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ^^^^^^^^^^^^^
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) AttributeError: 'NoneType' object has no attribute 'result'
(pid=957192) W1126 05:13:39.556000 957192 torch/utils/cpp_extension.py:117] No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' [repeated 7x across cluster]
(pid=957192) WARNING:2025-11-26 05:13:39,679:fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function. [repeated 7x across cluster]
(vLLMHttpServer pid=956139) WARNING 11-26 05:13:44 [async_llm.py:288] Processor has been moved under OpenAIServing and will be removed from AsyncLLM in v0.13.
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.2) with config: model='Qwen/Qwen2-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1536, download_dir=None, load_format=dummy, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2-7B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None},
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=705a35de25e240fb8669f1e540c9b8e5,prompt_token_ids_len=145,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1391, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={705a35de25e240fb8669f1e540c9b8e5: 145}, total_num_scheduled_tokens=145, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[10], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null, ec_connector_metadata=null)
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] EngineCore encountered a fatal error.
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] Traceback (most recent call last):
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 835, in run_engine_core
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] engine_core.run_busy_loop()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 862, in run_busy_loop
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] self._process_engine_step()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 891, in _process_engine_step
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] outputs, model_executed = self.step_fn()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] ^^^^^^^^^^^^^^
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 342, in step
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] model_output = future.result()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] ^^^^^^^^^^^^^
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] AttributeError: 'NoneType' object has no attribute 'result'
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525] AsyncLLM output_handler failed.
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525] Traceback (most recent call last):
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 477, in output_handler
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525] outputs = await engine_core.get_output_async()
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 883, in get_output_async
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525] raise self._format_exception(outputs) from None
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(vLLMHttpServer pid=956139) WARNING 11-26 05:13:31 [api_server.py:1567] LoRA dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development! [repeated 3x across cluster]
(vLLMHttpServer pid=956139) vllm version is 0.11.1 or higher, call init_app_state with 3 parameters [repeated 3x across cluster]
(vLLMHttpServer pid=956142) WARNING 11-26 05:13:31 [model.py:1568] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`. [repeated 3x across cluster]
(TaskRunner pid=950199) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::AgentLoopWorker.generate_sequences() (pid=957191, ip=10.65.31.151, actor_id=ae48089452b05295ea426df401000000, repr=<verl.experimental.agent_loop.agent_loop.AgentLoopWorker object at 0x15230ce970b0>)
(TaskRunner pid=950199) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(TaskRunner pid=950199) return self.__get_result()
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(TaskRunner pid=950199) raise self._exception
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/utils/transferqueue_utils.py", line 191, in dummy_async_inner
(TaskRunner pid=950199) return await func(*args, **kwargs)
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 386, in generate_sequences
(TaskRunner pid=950199) outputs = await asyncio.gather(*tasks)
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 421, in _run_agent_loop
(TaskRunner pid=950199) output: AgentLoopOutput = await agent_loop.run(sampling_params, **kwargs)
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) output = await self.server_manager.generate(
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/utils/rollout_trace.py", line 188, in async_wrapper
(TaskRunner pid=950199) return await func(self, *args, **kwargs)
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 110, in generate
(TaskRunner pid=950199) output = await server.generate.remote(
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) ray.exceptions.RayTaskError(EngineDeadError): ray::vLLMHttpServer.generate() (pid=956141, ip=10.65.31.151, actor_id=b5c328a6f98e7f44f22392cc01000000, repr=<verl.workers.rollout.vllm_rollout.vllm_async_server.vLLMHttpServer object at 0x1522e7853530>)
(TaskRunner pid=950199) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(TaskRunner pid=950199) return self.__get_result()
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(TaskRunner pid=950199) raise self._exception
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/workers/rollout/vllm_rollout/vllm_async_server.py", line 456, in generate
(TaskRunner pid=950199) async for output in generator:
(TaskRunner pid=950199) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 405, in generate
(TaskRunner pid=950199) q = await self.add_request(
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 276, in add_request
(TaskRunner pid=950199) raise EngineDeadError()
(TaskRunner pid=950199) vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(TaskRunner pid=950199) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::AgentLoopWorker.generate_sequences() (pid=957193, ip=10.65.31.151, actor_id=322b0fe0a409833940308b2201000000, repr=<verl.experimental.agent_loop.agent_loop.AgentLoopWorker object at 0x15230cddf290>)
(TaskRunner pid=950199) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
(TaskRunner pid=950199) return self.__get_result()
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(TaskRunner pid=950199) raise self._exception
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/utils/transferqueue_utils.py", line 191, in dummy_async_inner
(TaskRunner pid=950199) return await func(*args, **kwargs)
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 386, in generate_sequences
(TaskRunner pid=950199) outputs = await asyncio.gather(*tasks)
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 421, in _run_agent_loop
(TaskRunner pid=950199) output: AgentLoopOutput = await agent_loop.run(sampling_params, **kwargs)
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) output = await self.server_manager.generate(
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/utils/rollout_trace.py", line 188, in async_wrapper
(TaskRunner pid=950199) return await func(self, *args, **kwargs)
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 110, in generate
(TaskRunner pid=950199) output = await server.generate.remote(
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) ray.exceptions.RayTaskError(EngineDeadError): ray::vLLMHttpServer.generate() (pid=956141, ip=10.65.31.151, actor_id=b5c328a6f98e7f44f22392cc01000000, repr=<verl.workers.rollout.vllm_rollout.vllm_async_server.vLLMHttpServer object at 0x1522e7853530>)
(TaskRunner pid=950199) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(TaskRunner pid=950199) return self.__get_result()
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(TaskRunner pid=950199) raise self._exception
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/workers/rollout/vllm_rollout/vllm_async_server.py", line 456, in generate
(TaskRunner pid=950199) async for output in generator:
(TaskRunner pid=950199) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 405, in generate
(TaskRunner pid=950199) q = await self.add_request(
(TaskRunner pid=950199) ^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 276, in add_request
(TaskRunner pid=950199) raise EngineDeadError()
(TaskRunner pid=950199) vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/lustre/fsw/portfolios/coreai/users/shuangy/data/gsm8k/train.parquet', 'data.val_files=/lustre/fsw/portfolios/coreai/users/shuangy/data/gsm8k/test.parquet', 'data.train_batch_size=1024', 'data.max_prompt_length=512', 'data.max_response_length=1024', 'data.filter_overlong_prompts=True', 'data.truncation=error', 'actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=40', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.actor.entropy_coeff=0', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.6', 'actor_rollout_ref.rollout.n=5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.use_kl_in_reward=False', 'trainer.critic_warmup=0', 'trainer.logger=["console"]', 'trainer.project_name=verl_grpo_example_gsm8k', 'trainer.experiment_name=qwen2_7b_function_rm', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=20', 'trainer.test_freq=5', 'trainer.total_epochs=15']
Traceback (most recent call last):
File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/trainer/main_ppo.py", line 43, in main
run_ppo(config)
File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/trainer/main_ppo.py", line 97, in run_ppo
ray.get(runner.run.remote(config))
File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2961, in get
values, debugger_breakpoint = worker.get_objects(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1026, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(EngineDeadError): ray::TaskRunner.run() (pid=950199, ip=10.65.31.151, actor_id=5cf383b09c5d96b8658917ec01000000, repr=<main_ppo.TaskRunner object at 0x155551284c50>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/trainer/main_ppo.py", line 366, in run
trainer.fit()
File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/trainer/ppo/ray_trainer.py", line 996, in fit
val_metrics = self._validate()
^^^^^^^^^^^^^^^^
File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/trainer/ppo/ray_trainer.py", line 593, in _validate
test_output_gen_batch_padded = self.async_rollout_manager.generate_sequences(test_gen_batch_padded)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 800, in generate_sequences
outputs = ray.get(
^^^^^^^^
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(EngineDeadError): ray::AgentLoopWorker.generate_sequences() (pid=957185, ip=10.65.31.151, actor_id=67d7ba05eb11c8796a0b3b6b01000000, repr=<verl.experimental.agent_loop.agent_loop.AgentLoopWorker object at 0x15230ce83290>)
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/utils/transferqueue_utils.py", line 191, in dummy_async_inner
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 386, in generate_sequences
outputs = await asyncio.gather(*tasks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 421, in _run_agent_loop
output: AgentLoopOutput = await agent_loop.run(sampling_params, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/single_turn_agent_loop.py", line 66, in run
output = await self.server_manager.generate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/utils/rollout_trace.py", line 188, in async_wrapper
return await func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 110, in generate
output = await server.generate.remote(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(EngineDeadError): ray::vLLMHttpServer.generate() (pid=956141, ip=10.65.31.151, actor_id=b5c328a6f98e7f44f22392cc01000000, repr=<verl.workers.rollout.vllm_rollout.vllm_async_server.vLLMHttpServer object at 0x1522e7853530>)
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/workers/rollout/vllm_rollout/vllm_async_server.py", line 456, in generate
async for output in generator:
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 405, in generate
q = await self.add_request(
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 276, in add_request
raise EngineDeadError()
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(vLLMHttpServer pid=956141) WARNING 11-26 05:13:44 [async_llm.py:288] Processor has been moved under OpenAIServing and will be removed from AsyncLLM in v0.13. [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.2) with config: model='Qwen/Qwen2-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1536, download_dir=None, load_format=dummy, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2-7B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None}, [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=34e88a2459c1456bae942095edd523a0,prompt_token_ids_len=60,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1476, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None),block_ids=([1, 2, 3, 4],),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={34e88a2459c1456bae942095edd523a0: 60}, total_num_scheduled_tokens=60, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[4], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null, ec_connector_metadata=null) [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844] EngineCore encountered a fatal error. [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525] Traceback (most recent call last): [repeated 6x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 835, in run_engine_core [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844] engine_core.run_busy_loop() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 862, in run_busy_loop [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844] self._process_engine_step() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 891, in _process_engine_step [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844] outputs, model_executed = self.step_fn() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844] ^^^^^^^^^^^^^^ [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 342, in step [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844] model_output = future.result() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844] ^^^^^^^^^^^^^ [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844] AttributeError: 'NoneType' object has no attribute 'result' [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525] AsyncLLM output_handler failed. [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 477, in output_handler [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525] outputs = await engine_core.get_output_async() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 883, in get_output_async [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525] raise self._format_exception(outputs) from None [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) Process EngineCore_DP0: [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) Traceback (most recent call last): [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) self.run() [repeated 3x across cluster]
(TaskRunner pid=950199) File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/single_turn_agent_loop.py", line 66, in run [repeated 5x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) self._target(*self._args, **self._kwargs) [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 835, in run_engine_core [repeated 6x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) raise e [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) engine_core.run_busy_loop() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 862, in run_busy_loop [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) self._process_engine_step() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 891, in _process_engine_step [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) outputs, model_executed = self.step_fn() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ^^^^^^^^^^^^^^ [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 342, in step [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) model_output = future.result() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ^^^^^^^^^^^^^ [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) AttributeError: 'NoneType' object has no attribute 'result' [repeated 3x across cluster]
Could anyone advise on the root cause and solution? Thanks!
Expected behavior
Training runs successfully.
same error running with verlai/verl:vllm011.2.dev
Sglant not founded
The problem is related to the interface change of vllm. Updating the collective_rpc function in vllm_async_server.py solves the problem:
def collective_rpc(
self,
method: str | Callable,
timeout: Optional[float] = None,
args: tuple = (),
kwargs: Optional[dict[str, Any]] = None,
non_block: bool = False,
**kwargs_extra: Any,
) -> list[Any]:
"""Execute RPC call on all workers via ZeroMQ.
Args:
method: Method name or callable to execute
timeout: Timeout for the operation (currently unused)
args: Positional arguments
kwargs: Keyword arguments
non_block: If True, return Future objects for async execution.
If False (default), block until completion for backward compatibility.
**kwargs_extra: Additional keyword arguments (for compatibility)
Returns:
List of results from all workers, or list of Futures if non_block=True
"""
if isinstance(method, str):
sent_method = method
else:
sent_method = pickle.dumps(method)
del method
message = pickle.dumps((sent_method, args, kwargs or {}))
for socket in self.sockets:
socket.send(message, zmq.DONTWAIT)
if non_block:
# For async execution, return Future objects
futures = []
for socket in self.sockets:
future = Future()
def _recv_async(sock, fut):
try:
output = pickle.loads(sock.recv())
if isinstance(output, Exception):
fut.set_exception(output)
else:
fut.set_result(output)
except Exception as e:
fut.set_exception(e)
# Start a thread to receive the result asynchronously
thread = threading.Thread(target=_recv_async, args=(socket, future))
thread.daemon = True
thread.start()
futures.append(future)
return futures
else:
# Blocking execution - maintain backward compatibility with vllm 0.11.0
outputs = []
for socket in self.sockets:
outputs.append(pickle.loads(socket.recv()))
for output in outputs:
if isinstance(output, Exception):
raise output
return outputs