[trainer] fix: DataProto concat bug when timing generate_sequences is sightly different
What does this PR do?
During DataProto concat a list of data, it should ensure consistency for overlapping non-matric keys. However, we've observed that data['timing']['generate_sequences'] may differ slightly across nodes, even though derived metrics like generation_timing/min, generation_timing/max, and generation_timing/topk_ratio are identical.
merged_meta_info[k]
{'generate_sequences': 73.94637298583984, 'generation_timing/max': 75.57780456542969, 'generation_timing/min': 71.02742767333984, 'generation_timing/topk_ratio': 0.1666666716337204}
v
{'generate_sequences': 73.94638061523438, 'generation_timing/max': 75.57780456542969, 'generation_timing/min': 71.02742767333984, 'generation_timing/topk_ratio': 0.1666666716337204}
We suspect that generate_sequences represents the average generation time across nodes, and minor floating-point differences are expected due to non-deterministic execution or rounding.
To address this, we propose ignoring the difference in generate_sequences when round(...) values are the same, allowing the merge to proceed without raising unnecessary inconsistencies.
This issue has been reported by multiple users, as seen in volcengine/verl#3909.
This PR can avoid such error when running example script:
Traceback (most recent call last):
File "/home/ma-user/verl/verl/trainer/main_ppo.py", line 42, in main
run_ppo(config)
File "/home/ma-user/verl/verl/trainer/main_ppo.py", line 85, in run_ppo
ray.get(runner.run.remote(config))
File "/home/ma-user/.local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/ma-user/.local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/ma-user/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 2822, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/home/ma-user/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 930, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::TaskRunner.run() (pid=85558, ip=100.100.135.153, actor_id=b5884daf94341d43888419ed01000000, repr=<main_ppo.TaskRunner object at 0xffcfa94595a0>)
File "/home/ma-user/verl/verl/trainer/main_ppo.py", line 317, in run
trainer.fit()
File "/home/ma-user/verl/verl/trainer/ppo/ray_trainer.py", line 991, in fit
val_metrics = self._validate()
File "/home/ma-user/verl/verl/trainer/ppo/ray_trainer.py", line 579, in _validate
test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded)
File "/home/ma-user/verl/verl/single_controller/ray/base.py", line 49, in __call__
output = collect_fn(self, output)
File "/home/ma-user/verl/verl/single_controller/base/decorator.py", line 303, in collect_lazy_compute_data_proto
return collect_nd_compute_dataproto(collect_mask, worker_group, *args, **kwargs)
File "/home/ma-user/verl/verl/single_controller/base/decorator.py", line 269, in collect_nd_compute_dataproto
return _concat_data_proto_or_future(output)
File "/home/ma-user/verl/verl/single_controller/base/decorator.py", line 146, in _concat_data_proto_or_future
return DataProto.concat(output)
File "/home/ma-user/verl/verl/protocol.py", line 959, in concat
assert merged_meta_info[k] == v, f"Conflicting values for meta_info key '{k}'"
AssertionError: Conflicting values for meta_info key 'timing'
Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.
Checklist Before Starting
- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI)-
{modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data - If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc] -
{type}is infeat,fix,refactor,chore,test - If this PR breaks any API (CLI arguments, config, function signature, etc.), add
[BREAKING]to the beginning of the title. - Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
-
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use this
Design & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- [ ] Read the Contribute Guide.
- [ ] Apply pre-commit checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - [ ] Add / Update the documentation.
- [ ] Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Replace round() with math.isclose(val_a, val_b, rel_tol=0.5) in commit 566ca6ce724684726ffaa4f0a6617ddc3cede79e
@FightingZhen @ji-huazhong please review this PR
@jingshenghang the pre-commit check failed
@jingshenghang the pre-commit check failed
passed, now ^_^
@quancs @ji-huazhong Hi, I have re-committed and modified author information. Please help review, thanks.
@quancs @ji-huazhong Hi, The CI failure doesn't seem to have been introduced in this PR. Could you help review and re-trigger the CI?
Hi @FightingZhen @ji-huazhong @quancs , All CI has passed, please help review and merge, thanks.