[trainer] fix: PPO reward model resource pool and worker creation with reward loop
What does this PR do?
This PR fixes a bug in the main_ppo entrypoint regarding reward model initialization when using the reward loop.
In the current main, the reward model resource pool and workers get assigned the wrong worker classes (i.e. the non-reward loop ones). The reward loop then assumes that the workers have the right (rollout replica) class, causing a crash.
Checklist Before Starting
- [x] Search for similar PRs. Paste at least one query link here: link
- [x] Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc] {type}is infeat,fix,refactor,chore,test- If this PR breaks any API (CLI arguments, config, function signature, etc.), add
[BREAKING]to the beginning of the title. - Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
Test
Tested by running main_ppo with config.actor_rollout_ref.rollout.mode="async", config.reward_model.enable=True, config.reward_model.enable_resource_pool=True. The runs fail to start on main but succeed in this branch.
API and Usage Example
No API changes.
Design & Code Changes
The PR stops main_ppo from creating the reward model resource pool and assigning the worker classes, so that these can be carried out later in the reward loop initialization.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- [X] Read the Contribute Guide.
- [X] Apply pre-commit checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - [X] Add / Update the documentation.
- [X] Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
The current implementation of the reward loop actually creates rollout servers colocated with the resources of worker_group, without invoking any methods of worker_group itself, so it may not cause errors. Could you provide some crash details or scripts for further investigation?
This is a compromise implementation intended to maintain compatibility with the previous FSDP/Megatron reward model setup. However, it will be removed in future versions, as it’s not entirely safe. The ideal approach would be to initialize reward loop models with a resource_pool instead of worker_group.
without invoking any methods of worker_group itself
This doesn't seem to be the case for vllm rollouts, as in this line.
Could you provide some crash details or scripts for further investigation?
I will try to provide a script to reproduce this later, but here is a crash log in the meantime (you can see that it happens at the line linked above):
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/verl/trainer/main_ppo.py", line 42, in main
run_ppo(config)
File "/usr/local/lib/python3.10/dist-packages/verl/trainer/main_ppo.py", line 96, in run_ppo
ray.get(runner.run.remote(config))
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2849, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 937, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::TaskRunner.run() (pid=[redacted], ip=[redacted], actor_id=591aa4c7af47d5e5b202b79102000000, repr=<main_ppo.TaskRunner object at 0x7f443091b280>)
File "/usr/local/lib/python3.10/dist-packages/verl/trainer/main_ppo.py", line 338, in run
trainer.init_workers()
File "/usr/local/lib/python3.10/dist-packages/verl/trainer/ppo/ray_trainer.py", line 774, in init_workers
self.async_rollout_manager = AgentLoopManager(
File "/usr/local/lib/python3.10/dist-packages/verl/experimental/agent_loop/agent_loop.py", line 674, in __init__
self.reward_model_manager = RewardModelManager(config.reward_model, rm_wg)
File "/usr/local/lib/python3.10/dist-packages/verl/experimental/reward/reward_model.py", line 45, in __init__
self._initialize_llm_servers()
File "/usr/local/lib/python3.10/dist-packages/verl/experimental/reward/reward_model.py", line 78, in _initialize_llm_servers
self._run_all([server.init_colocated(self.worker_group) for server in self.rollout_replicas])
File "/usr/local/lib/python3.10/dist-packages/verl/experimental/reward/reward_model.py", line 109, in _run_all
return asyncio.run(run_all())
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.10/dist-packages/verl/experimental/reward/reward_model.py", line 107, in run_all
return await asyncio.gather(*tasks)
File "/usr/local/lib/python3.10/dist-packages/verl/workers/rollout/replica.py", line 133, in init_colocated
await self.launch_servers()
File "/usr/local/lib/python3.10/dist-packages/verl/workers/rollout/vllm_rollout/vllm_async_server.py", line 530, in launch_servers
await asyncio.gather(
File "/usr/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
return (yield from awaitable.__await__())
ray.exceptions.RayTaskError(AttributeError): ray::vLLMHttpServer.launch_server() (pid=[redacted], ip=[redacted], actor_id=f3f6cfc29255eea5d4e81e0502000000, repr=<verl.workers.rollout.vllm_rollout.vllm_async_server.vLLMHttpServer object at 0x7f05ebe78460>)
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/local/lib/python3.10/dist-packages/verl/workers/rollout/vllm_rollout/vllm_async_server.py", line 300, in launch_server
zmq_addresses = ray.get([worker.get_zeromq_address.remote() for worker in self.workers])
File "/usr/local/lib/python3.10/dist-packages/verl/workers/rollout/vllm_rollout/vllm_async_server.py", line 300, in <listcomp>
zmq_addresses = ray.get([worker.get_zeromq_address.remote() for worker in self.workers])
File "/usr/local/lib/python3.10/dist-packages/ray/actor.py", line 1562, in __getattr__
raise AttributeError(
AttributeError: 'ActorHandle' object has no attribute 'get_zeromq_address'
(the workers are initialized to the previous FSDP reward model workers rather vLLMAsyncRollout).
I Got it. vllm replica does call methods from the worker class, and this was missed in the previous CI tests.