async rollout fails with 'NoneType' object has no attribute 'result' error

Open sharonyu-115 opened this issue 1 month ago • 3 comments

System Info

root@pool0-01705:~/src/verl_main_fp8/verl# python scripts/diagnose.py
----------Python Info----------
Version      : 3.12.12
Compiler     : GCC 11.4.0
Build        : ('main', 'Oct 10 2025 08:52:57')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 25.3
Directory    : /usr/local/lib/python3.12/dist-packages/pip
vllm	     : 0.11.2+cu129
sglang	     : not found.
ray	     : 2.51.1
torch	     : 2.9.0+cu129
----------verl Info-----------
Version      : 0.7.0.dev
Directory    : /lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl
Commit Hash  : 152ca54fa7e68ce03b3b960c793d2a7a5becf8de
----------Platform Info----------
Platform     : Linux-5.15.0-1063-nvidia-x86_64-with-glibc2.35
system       : Linux
node         : pool0-01705
release      : 5.15.0-1063-nvidia
version      : #64-Ubuntu SMP Fri Aug 9 17:13:45 UTC 2024
----------Environment----------
CUDA Runtime : 12.9
CUDA Compiler : Cuda compilation tools, release 12.9, V12.9.86
----------System Info----------
CPU Memory	: 2015.54 GB
GPU Count	: 8
GPU 1	Type    : NVIDIA H100 80GB HBM3
GPU 1	Memory  : 79.65 GB
GPU 2	Type    : NVIDIA H100 80GB HBM3
GPU 2	Memory  : 79.65 GB
GPU 3	Type    : NVIDIA H100 80GB HBM3
GPU 3	Memory  : 79.65 GB
GPU 4	Type    : NVIDIA H100 80GB HBM3
GPU 4	Memory  : 79.65 GB
GPU 5	Type    : NVIDIA H100 80GB HBM3
GPU 5	Memory  : 79.65 GB
GPU 6	Type    : NVIDIA H100 80GB HBM3
GPU 6	Memory  : 79.65 GB
GPU 7	Type    : NVIDIA H100 80GB HBM3
GPU 7	Memory  : 79.65 GB
GPU 8	Type    : NVIDIA H100 80GB HBM3
GPU 8	Memory  : 79.65 GB

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I encountered the following problem with the verlai/verl:vllm011.2.dev3 image from https://hub.docker.com/r/verlai/verl/tags Problem can be recreated using the examples/grpo_trainer/run_qwen2-7b.sh example. latest code from main. Async rollout fails with error:

(vLLMHttpServer pid=956140) vllm version is 0.11.1 or higher, call init_app_state with 3 parameters
(vLLMHttpServer pid=956140) WARNING 11-26 05:13:30 [model.py:1568] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(vLLMHttpServer pid=956139) INFO:2025-11-26 05:13:32,072:Initializing a V1 LLM engine with config: model='Qwen/Qwen2-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1536, download_dir=None, load_format=dummy, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2-7B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None}
(TaskRunner pid=950199) AgentLoopManager: ['10.65.31.151:27177', '10.65.31.151:24475', '10.65.31.151:30833', '10.65.31.151:15975']
(TaskRunner pid=950199) Checkpoint tracker file does not exist: /lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/checkpoints/verl_grpo_example_gsm8k/qwen2_7b_function_rm/latest_checkpointed_iteration.txt
(TaskRunner pid=950199) Training from scratch
(TaskRunner pid=950199) test_gen_batch meta info: {'eos_token_id': 151645, 'pad_token_id': 151643, 'recompute_log_prob': False, 'do_sample': False, 'validate': True, 'global_steps': 0}
(pid=957185) W1126 05:13:38.672000 957185 torch/utils/cpp_extension.py:117] No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
(pid=957185) WARNING:2025-11-26 05:13:38,703:fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) Process EngineCore_DP0:
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) Traceback (most recent call last):
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)     self.run()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)     self._target(*self._args, **self._kwargs)
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 846, in run_engine_core
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)     raise e
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 835, in run_engine_core
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)     engine_core.run_busy_loop()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 862, in run_busy_loop
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)     self._process_engine_step()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 891, in _process_engine_step
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)     outputs, model_executed = self.step_fn()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)                               ^^^^^^^^^^^^^^
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 342, in step
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)     model_output = future.result()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602)                    ^^^^^^^^^^^^^
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) AttributeError: 'NoneType' object has no attribute 'result'
(pid=957192) W1126 05:13:39.556000 957192 torch/utils/cpp_extension.py:117] No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' [repeated 7x across cluster]
(pid=957192) WARNING:2025-11-26 05:13:39,679:fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function. [repeated 7x across cluster]
(vLLMHttpServer pid=956139) WARNING 11-26 05:13:44 [async_llm.py:288] Processor has been moved under OpenAIServing and will be removed from AsyncLLM in v0.13.
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.2) with config: model='Qwen/Qwen2-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1536, download_dir=None, load_format=dummy, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2-7B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None},
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=705a35de25e240fb8669f1e540c9b8e5,prompt_token_ids_len=145,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1391, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={705a35de25e240fb8669f1e540c9b8e5: 145}, total_num_scheduled_tokens=145, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[10], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null, ec_connector_metadata=null)
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] EngineCore encountered a fatal error.
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] Traceback (most recent call last):
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 835, in run_engine_core
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844]     engine_core.run_busy_loop()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 862, in run_busy_loop
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844]     self._process_engine_step()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 891, in _process_engine_step
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844]     outputs, model_executed = self.step_fn()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844]                               ^^^^^^^^^^^^^^
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 342, in step
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844]     model_output = future.result()
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844]                    ^^^^^^^^^^^^^
(vLLMHttpServer pid=956139) (EngineCore_DP0 pid=956602) ERROR 11-26 05:13:44 [core.py:844] AttributeError: 'NoneType' object has no attribute 'result'
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525] AsyncLLM output_handler failed.
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525] Traceback (most recent call last):
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 477, in output_handler
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525]     outputs = await engine_core.get_output_async()
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 883, in get_output_async
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525]     raise self._format_exception(outputs) from None
(vLLMHttpServer pid=956139) ERROR 11-26 05:13:44 [async_llm.py:525] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(vLLMHttpServer pid=956139) WARNING 11-26 05:13:31 [api_server.py:1567] LoRA dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development! [repeated 3x across cluster]
(vLLMHttpServer pid=956139) vllm version is 0.11.1 or higher, call init_app_state with 3 parameters [repeated 3x across cluster]
(vLLMHttpServer pid=956142) WARNING 11-26 05:13:31 [model.py:1568] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`. [repeated 3x across cluster]
(TaskRunner pid=950199) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::AgentLoopWorker.generate_sequences() (pid=957191, ip=10.65.31.151, actor_id=ae48089452b05295ea426df401000000, repr=<verl.experimental.agent_loop.agent_loop.AgentLoopWorker object at 0x15230ce970b0>)
(TaskRunner pid=950199)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(TaskRunner pid=950199)     return self.__get_result()
(TaskRunner pid=950199)            ^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(TaskRunner pid=950199)     raise self._exception
(TaskRunner pid=950199)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/utils/transferqueue_utils.py", line 191, in dummy_async_inner
(TaskRunner pid=950199)     return await func(*args, **kwargs)
(TaskRunner pid=950199)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 386, in generate_sequences
(TaskRunner pid=950199)     outputs = await asyncio.gather(*tasks)
(TaskRunner pid=950199)               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 421, in _run_agent_loop
(TaskRunner pid=950199)     output: AgentLoopOutput = await agent_loop.run(sampling_params, **kwargs)
(TaskRunner pid=950199)                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)     output = await self.server_manager.generate(
(TaskRunner pid=950199)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/utils/rollout_trace.py", line 188, in async_wrapper
(TaskRunner pid=950199)     return await func(self, *args, **kwargs)
(TaskRunner pid=950199)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 110, in generate
(TaskRunner pid=950199)     output = await server.generate.remote(
(TaskRunner pid=950199)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) ray.exceptions.RayTaskError(EngineDeadError): ray::vLLMHttpServer.generate() (pid=956141, ip=10.65.31.151, actor_id=b5c328a6f98e7f44f22392cc01000000, repr=<verl.workers.rollout.vllm_rollout.vllm_async_server.vLLMHttpServer object at 0x1522e7853530>)
(TaskRunner pid=950199)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(TaskRunner pid=950199)     return self.__get_result()
(TaskRunner pid=950199)            ^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(TaskRunner pid=950199)     raise self._exception
(TaskRunner pid=950199)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/workers/rollout/vllm_rollout/vllm_async_server.py", line 456, in generate
(TaskRunner pid=950199)     async for output in generator:
(TaskRunner pid=950199)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 405, in generate
(TaskRunner pid=950199)     q = await self.add_request(
(TaskRunner pid=950199)         ^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 276, in add_request
(TaskRunner pid=950199)     raise EngineDeadError()
(TaskRunner pid=950199) vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(TaskRunner pid=950199) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::AgentLoopWorker.generate_sequences() (pid=957193, ip=10.65.31.151, actor_id=322b0fe0a409833940308b2201000000, repr=<verl.experimental.agent_loop.agent_loop.AgentLoopWorker object at 0x15230cddf290>)
(TaskRunner pid=950199)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
(TaskRunner pid=950199)     return self.__get_result()
(TaskRunner pid=950199)            ^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(TaskRunner pid=950199)     raise self._exception
(TaskRunner pid=950199)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/utils/transferqueue_utils.py", line 191, in dummy_async_inner
(TaskRunner pid=950199)     return await func(*args, **kwargs)
(TaskRunner pid=950199)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 386, in generate_sequences
(TaskRunner pid=950199)     outputs = await asyncio.gather(*tasks)
(TaskRunner pid=950199)               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 421, in _run_agent_loop
(TaskRunner pid=950199)     output: AgentLoopOutput = await agent_loop.run(sampling_params, **kwargs)
(TaskRunner pid=950199)                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)     output = await self.server_manager.generate(
(TaskRunner pid=950199)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/utils/rollout_trace.py", line 188, in async_wrapper
(TaskRunner pid=950199)     return await func(self, *args, **kwargs)
(TaskRunner pid=950199)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 110, in generate
(TaskRunner pid=950199)     output = await server.generate.remote(
(TaskRunner pid=950199)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199) ray.exceptions.RayTaskError(EngineDeadError): ray::vLLMHttpServer.generate() (pid=956141, ip=10.65.31.151, actor_id=b5c328a6f98e7f44f22392cc01000000, repr=<verl.workers.rollout.vllm_rollout.vllm_async_server.vLLMHttpServer object at 0x1522e7853530>)
(TaskRunner pid=950199)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(TaskRunner pid=950199)     return self.__get_result()
(TaskRunner pid=950199)            ^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(TaskRunner pid=950199)     raise self._exception
(TaskRunner pid=950199)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/workers/rollout/vllm_rollout/vllm_async_server.py", line 456, in generate
(TaskRunner pid=950199)     async for output in generator:
(TaskRunner pid=950199)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 405, in generate
(TaskRunner pid=950199)     q = await self.add_request(
(TaskRunner pid=950199)         ^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=950199)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 276, in add_request
(TaskRunner pid=950199)     raise EngineDeadError()
(TaskRunner pid=950199) vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/lustre/fsw/portfolios/coreai/users/shuangy/data/gsm8k/train.parquet', 'data.val_files=/lustre/fsw/portfolios/coreai/users/shuangy/data/gsm8k/test.parquet', 'data.train_batch_size=1024', 'data.max_prompt_length=512', 'data.max_response_length=1024', 'data.filter_overlong_prompts=True', 'data.truncation=error', 'actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=40', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.actor.entropy_coeff=0', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.6', 'actor_rollout_ref.rollout.n=5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.use_kl_in_reward=False', 'trainer.critic_warmup=0', 'trainer.logger=["console"]', 'trainer.project_name=verl_grpo_example_gsm8k', 'trainer.experiment_name=qwen2_7b_function_rm', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=20', 'trainer.test_freq=5', 'trainer.total_epochs=15']
Traceback (most recent call last):
  File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/trainer/main_ppo.py", line 43, in main
    run_ppo(config)
  File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/trainer/main_ppo.py", line 97, in run_ppo
    ray.get(runner.run.remote(config))
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2961, in get
    values, debugger_breakpoint = worker.get_objects(
                                  ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1026, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(EngineDeadError): ray::TaskRunner.run() (pid=950199, ip=10.65.31.151, actor_id=5cf383b09c5d96b8658917ec01000000, repr=<main_ppo.TaskRunner object at 0x155551284c50>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/trainer/main_ppo.py", line 366, in run
    trainer.fit()
  File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/trainer/ppo/ray_trainer.py", line 996, in fit
    val_metrics = self._validate()
                  ^^^^^^^^^^^^^^^^
  File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/trainer/ppo/ray_trainer.py", line 593, in _validate
    test_output_gen_batch_padded = self.async_rollout_manager.generate_sequences(test_gen_batch_padded)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 800, in generate_sequences
    outputs = ray.get(
              ^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(EngineDeadError): ray::AgentLoopWorker.generate_sequences() (pid=957185, ip=10.65.31.151, actor_id=67d7ba05eb11c8796a0b3b6b01000000, repr=<verl.experimental.agent_loop.agent_loop.AgentLoopWorker object at 0x15230ce83290>)
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/utils/transferqueue_utils.py", line 191, in dummy_async_inner
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 386, in generate_sequences
    outputs = await asyncio.gather(*tasks)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 421, in _run_agent_loop
    output: AgentLoopOutput = await agent_loop.run(sampling_params, **kwargs)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/single_turn_agent_loop.py", line 66, in run
    output = await self.server_manager.generate(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/utils/rollout_trace.py", line 188, in async_wrapper
    return await func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/agent_loop.py", line 110, in generate
    output = await server.generate.remote(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(EngineDeadError): ray::vLLMHttpServer.generate() (pid=956141, ip=10.65.31.151, actor_id=b5c328a6f98e7f44f22392cc01000000, repr=<verl.workers.rollout.vllm_rollout.vllm_async_server.vLLMHttpServer object at 0x1522e7853530>)
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/workers/rollout/vllm_rollout/vllm_async_server.py", line 456, in generate
    async for output in generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 405, in generate
    q = await self.add_request(
        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 276, in add_request
    raise EngineDeadError()
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(vLLMHttpServer pid=956141) WARNING 11-26 05:13:44 [async_llm.py:288] Processor has been moved under OpenAIServing and will be removed from AsyncLLM in v0.13. [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.2) with config: model='Qwen/Qwen2-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1536, download_dir=None, load_format=dummy, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2-7B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None},  [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=34e88a2459c1456bae942095edd523a0,prompt_token_ids_len=60,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1476, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None),block_ids=([1, 2, 3, 4],),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={34e88a2459c1456bae942095edd523a0: 60}, total_num_scheduled_tokens=60, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[4], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null, ec_connector_metadata=null) [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844] EngineCore encountered a fatal error. [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525] Traceback (most recent call last): [repeated 6x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 835, in run_engine_core [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844]     engine_core.run_busy_loop() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 862, in run_busy_loop [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844]     self._process_engine_step() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 891, in _process_engine_step [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844]     outputs, model_executed = self.step_fn() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844]                               ^^^^^^^^^^^^^^ [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 342, in step [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844]     model_output = future.result() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844]                    ^^^^^^^^^^^^^ [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) ERROR 11-26 05:13:44 [core.py:844] AttributeError: 'NoneType' object has no attribute 'result' [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525] AsyncLLM output_handler failed. [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 477, in output_handler [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525]     outputs = await engine_core.get_output_async() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 883, in get_output_async [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525]     raise self._format_exception(outputs) from None [repeated 3x across cluster]
(vLLMHttpServer pid=956141) ERROR 11-26 05:13:44 [async_llm.py:525] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) Process EngineCore_DP0: [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) Traceback (most recent call last): [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620)     self.run() [repeated 3x across cluster]
(TaskRunner pid=950199)   File "/lustre/fsw/portfolios/coreai/users/shuangy/src/verl_main_fp8/verl/verl/experimental/agent_loop/single_turn_agent_loop.py", line 66, in run [repeated 5x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620)     self._target(*self._args, **self._kwargs) [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 835, in run_engine_core [repeated 6x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620)     raise e [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620)     engine_core.run_busy_loop() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 862, in run_busy_loop [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620)     self._process_engine_step() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 891, in _process_engine_step [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620)     outputs, model_executed = self.step_fn() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620)                               ^^^^^^^^^^^^^^ [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 342, in step [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620)     model_output = future.result() [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620)                    ^^^^^^^^^^^^^ [repeated 3x across cluster]
(vLLMHttpServer pid=956141) (EngineCore_DP0 pid=956620) AttributeError: 'NoneType' object has no attribute 'result' [repeated 3x across cluster]

Could anyone advise on the root cause and solution? Thanks!

Expected behavior

Training runs successfully.

Nov 26 '25 13:11 sharonyu-115

same error running with verlai/verl:vllm011.2.dev

Nov 26 '25 20:11 hellcatCS

Sglant not founded

Nov 29 '25 18:11 sinajafari007

The problem is related to the interface change of vllm. Updating the collective_rpc function in vllm_async_server.py solves the problem:

    def collective_rpc(
        self,
        method: str | Callable,
        timeout: Optional[float] = None,
        args: tuple = (),
        kwargs: Optional[dict[str, Any]] = None,
        non_block: bool = False,
        **kwargs_extra: Any,
    ) -> list[Any]:
        """Execute RPC call on all workers via ZeroMQ.
        
        Args:
            method: Method name or callable to execute
            timeout: Timeout for the operation (currently unused)
            args: Positional arguments
            kwargs: Keyword arguments
            non_block: If True, return Future objects for async execution.
                      If False (default), block until completion for backward compatibility.
            **kwargs_extra: Additional keyword arguments (for compatibility)
        
        Returns:
            List of results from all workers, or list of Futures if non_block=True
        """
        if isinstance(method, str):
            sent_method = method
        else:
            sent_method = pickle.dumps(method)
        del method

        message = pickle.dumps((sent_method, args, kwargs or {}))
        for socket in self.sockets:
            socket.send(message, zmq.DONTWAIT)

        if non_block:
            # For async execution, return Future objects
            futures = []
            for socket in self.sockets:
                future = Future()
                
                def _recv_async(sock, fut):
                    try:
                        output = pickle.loads(sock.recv())
                        if isinstance(output, Exception):
                            fut.set_exception(output)
                        else:
                            fut.set_result(output)
                    except Exception as e:
                        fut.set_exception(e)
                
                # Start a thread to receive the result asynchronously
                thread = threading.Thread(target=_recv_async, args=(socket, future))
                thread.daemon = True
                thread.start()
                futures.append(future)

            return futures
        else:
            # Blocking execution - maintain backward compatibility with vllm 0.11.0
            outputs = []
            for socket in self.sockets:
                outputs.append(pickle.loads(socket.recv()))

            for output in outputs:
                if isinstance(output, Exception):
                    raise output
            return outputs

Dec 01 '25 02:12 sharonyu-115