verl [trainer] fix: PPO reward model resource pool and worker creation with reward loop

What does this PR do?

This PR fixes a bug in the main_ppo entrypoint regarding reward model initialization when using the reward loop. In the current main, the reward model resource pool and workers get assigned the wrong worker classes (i.e. the non-reward loop ones). The reward loop then assumes that the workers have the right (rollout replica) class, causing a crash.

Checklist Before Starting

[x] Search for similar PRs. Paste at least one query link here: link
[x] Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Tested by running main_ppo with config.actor_rollout_ref.rollout.mode="async", config.reward_model.enable=True, config.reward_model.enable_resource_pool=True. The runs fail to start on main but succeed in this branch.

API and Usage Example

No API changes.

Design & Code Changes

The PR stops main_ppo from creating the reward model resource pool and assigning the worker classes, so that these can be carried out later in the reward loop initialization.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

[X] Read the Contribute Guide.
[X] Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
[X] Add / Update the documentation.
[X] Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
[ ] Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

Nov 10 '25 08:11 guillemgt

All committers have signed the CLA.

Nov 10 '25 08:11 CLAassistant

The current implementation of the reward loop actually creates rollout servers colocated with the resources of worker_group, without invoking any methods of worker_group itself, so it may not cause errors. Could you provide some crash details or scripts for further investigation?

This is a compromise implementation intended to maintain compatibility with the previous FSDP/Megatron reward model setup. However, it will be removed in future versions, as it’s not entirely safe. The ideal approach would be to initialize reward loop models with a resource_pool instead of worker_group.

Nov 10 '25 14:11 yyDing1

without invoking any methods of worker_group itself

This doesn't seem to be the case for vllm rollouts, as in this line.

Could you provide some crash details or scripts for further investigation?

I will try to provide a script to reproduce this later, but here is a crash log in the meantime (you can see that it happens at the line linked above):

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/verl/trainer/main_ppo.py", line 42, in main
    run_ppo(config)
  File "/usr/local/lib/python3.10/dist-packages/verl/trainer/main_ppo.py", line 96, in run_ppo
    ray.get(runner.run.remote(config))
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2849, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 937, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::TaskRunner.run() (pid=[redacted], ip=[redacted], actor_id=591aa4c7af47d5e5b202b79102000000, repr=<main_ppo.TaskRunner object at 0x7f443091b280>)
  File "/usr/local/lib/python3.10/dist-packages/verl/trainer/main_ppo.py", line 338, in run
    trainer.init_workers()
  File "/usr/local/lib/python3.10/dist-packages/verl/trainer/ppo/ray_trainer.py", line 774, in init_workers
    self.async_rollout_manager = AgentLoopManager(
  File "/usr/local/lib/python3.10/dist-packages/verl/experimental/agent_loop/agent_loop.py", line 674, in __init__
    self.reward_model_manager = RewardModelManager(config.reward_model, rm_wg)
  File "/usr/local/lib/python3.10/dist-packages/verl/experimental/reward/reward_model.py", line 45, in __init__
    self._initialize_llm_servers()
  File "/usr/local/lib/python3.10/dist-packages/verl/experimental/reward/reward_model.py", line 78, in _initialize_llm_servers
    self._run_all([server.init_colocated(self.worker_group) for server in self.rollout_replicas])
  File "/usr/local/lib/python3.10/dist-packages/verl/experimental/reward/reward_model.py", line 109, in _run_all
    return asyncio.run(run_all())
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/dist-packages/verl/experimental/reward/reward_model.py", line 107, in run_all
    return await asyncio.gather(*tasks)
  File "/usr/local/lib/python3.10/dist-packages/verl/workers/rollout/replica.py", line 133, in init_colocated
    await self.launch_servers()
  File "/usr/local/lib/python3.10/dist-packages/verl/workers/rollout/vllm_rollout/vllm_async_server.py", line 530, in launch_servers
    await asyncio.gather(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
ray.exceptions.RayTaskError(AttributeError): ray::vLLMHttpServer.launch_server() (pid=[redacted], ip=[redacted], actor_id=f3f6cfc29255eea5d4e81e0502000000, repr=<verl.workers.rollout.vllm_rollout.vllm_async_server.vLLMHttpServer object at 0x7f05ebe78460>)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/dist-packages/verl/workers/rollout/vllm_rollout/vllm_async_server.py", line 300, in launch_server
    zmq_addresses = ray.get([worker.get_zeromq_address.remote() for worker in self.workers])
  File "/usr/local/lib/python3.10/dist-packages/verl/workers/rollout/vllm_rollout/vllm_async_server.py", line 300, in <listcomp>
    zmq_addresses = ray.get([worker.get_zeromq_address.remote() for worker in self.workers])
  File "/usr/local/lib/python3.10/dist-packages/ray/actor.py", line 1562, in __getattr__
    raise AttributeError(
AttributeError: 'ActorHandle' object has no attribute 'get_zeromq_address'

(the workers are initialized to the previous FSDP reward model workers rather vLLMAsyncRollout).

Nov 10 '25 15:11 guillemgt

I Got it. vllm replica does call methods from the worker class, and this was missed in the previous CI tests.

Nov 10 '25 16:11 yyDing1