verl icon indicating copy to clipboard operation
verl copied to clipboard

[trainer] fix: DataProto concat bug when timing generate_sequences is sightly different

Open jingshenghang opened this issue 3 months ago • 6 comments

What does this PR do?

During DataProto concat a list of data, it should ensure consistency for overlapping non-matric keys. However, we've observed that data['timing']['generate_sequences'] may differ slightly across nodes, even though derived metrics like generation_timing/min, generation_timing/max, and generation_timing/topk_ratio are identical.

merged_meta_info[k]
{'generate_sequences': 73.94637298583984, 'generation_timing/max': 75.57780456542969, 'generation_timing/min': 71.02742767333984, 'generation_timing/topk_ratio': 0.1666666716337204}

v
{'generate_sequences': 73.94638061523438, 'generation_timing/max': 75.57780456542969, 'generation_timing/min': 71.02742767333984, 'generation_timing/topk_ratio': 0.1666666716337204}

We suspect that generate_sequences represents the average generation time across nodes, and minor floating-point differences are expected due to non-deterministic execution or rounding.

To address this, we propose ignoring the difference in generate_sequences when round(...) values are the same, allowing the merge to proceed without raising unnecessary inconsistencies.

This issue has been reported by multiple users, as seen in volcengine/verl#3909.

This PR can avoid such error when running example script:

Traceback (most recent call last):
  File "/home/ma-user/verl/verl/trainer/main_ppo.py", line 42, in main
    run_ppo(config)
  File "/home/ma-user/verl/verl/trainer/main_ppo.py", line 85, in run_ppo
    ray.get(runner.run.remote(config))
  File "/home/ma-user/.local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ma-user/.local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ma-user/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/ma-user/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 930, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::TaskRunner.run() (pid=85558, ip=100.100.135.153, actor_id=b5884daf94341d43888419ed01000000, repr=<main_ppo.TaskRunner object at 0xffcfa94595a0>)
  File "/home/ma-user/verl/verl/trainer/main_ppo.py", line 317, in run
    trainer.fit()
  File "/home/ma-user/verl/verl/trainer/ppo/ray_trainer.py", line 991, in fit
    val_metrics = self._validate()
  File "/home/ma-user/verl/verl/trainer/ppo/ray_trainer.py", line 579, in _validate
    test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded)
  File "/home/ma-user/verl/verl/single_controller/ray/base.py", line 49, in __call__
    output = collect_fn(self, output)
  File "/home/ma-user/verl/verl/single_controller/base/decorator.py", line 303, in collect_lazy_compute_data_proto
    return collect_nd_compute_dataproto(collect_mask, worker_group, *args, **kwargs)
  File "/home/ma-user/verl/verl/single_controller/base/decorator.py", line 269, in collect_nd_compute_dataproto
    return _concat_data_proto_or_future(output)
  File "/home/ma-user/verl/verl/single_controller/base/decorator.py", line 146, in _concat_data_proto_or_future
    return DataProto.concat(output)
  File "/home/ma-user/verl/verl/protocol.py", line 959, in concat
    assert merged_meta_info[k] == v, f"Conflicting values for meta_info key '{k}'"
AssertionError: Conflicting values for meta_info key 'timing'

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

  • [ ] Search for similar PRs. Paste at least one query link here: ...
  • [ ] Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

jingshenghang avatar Nov 06 '25 02:11 jingshenghang

Replace round() with math.isclose(val_a, val_b, rel_tol=0.5) in commit 566ca6ce724684726ffaa4f0a6617ddc3cede79e

jingshenghang avatar Nov 06 '25 03:11 jingshenghang

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Nov 21 '25 01:11 CLAassistant

@FightingZhen @ji-huazhong please review this PR

quancs avatar Nov 21 '25 01:11 quancs

@jingshenghang the pre-commit check failed

ji-huazhong avatar Nov 21 '25 02:11 ji-huazhong

@jingshenghang the pre-commit check failed

passed, now ^_^

quancs avatar Nov 23 '25 11:11 quancs

@quancs @ji-huazhong Hi, I have re-committed and modified author information. Please help review, thanks.

jingshenghang avatar Nov 24 '25 02:11 jingshenghang

@quancs @ji-huazhong Hi, The CI failure doesn't seem to have been introduced in this PR. Could you help review and re-trigger the CI?

jingshenghang avatar Dec 09 '25 02:12 jingshenghang

Hi @FightingZhen @ji-huazhong @quancs , All CI has passed, please help review and merge, thanks.

jingshenghang avatar Dec 10 '25 07:12 jingshenghang