verl dapo reward

System Info

In the new version of the code, the calculation of reward will be included in the generation of the sequence, which results in the logic of overlong_buffer_cfg being skipped.

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

ray job submit --runtime-env="${RUNTIME_ENV}"
--working-dir "${WORKING_DIR}"
-- python3 -m recipe.dapo.main_dapo
data.train_files="${TRAIN_FILE}"
data.val_files="${TEST_FILE}"
data.prompt_key=prompt
data.truncation='left'
data.seed=42
data.max_prompt_length=${max_prompt_length}
data.max_response_length=${max_response_length}
data.train_batch_size=${train_prompt_bsz}
actor_rollout_ref.rollout.n=${n_resp_per_prompt}
actor_rollout_ref.rollout.dtype=${dtype}
+actor_rollout_ref.actor.fsdp_config.mixed_precision.param_dtype=${dtype}
actor_rollout_ref.actor.fsdp_config.dtype=${dtype}
actor_rollout_ref.ref.fsdp_config.dtype=${dtype}
algorithm.adv_estimator=${adv_estimator}
algorithm.use_kl_in_reward=${use_kl_in_reward}
algorithm.kl_ctrl.kl_coef=${kl_coef}
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss}
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef}
actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low}
actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high}
actor_rollout_ref.actor.clip_ratio_c=10.0
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8
algorithm.filter_groups.enable=${enable_filter_groups}
algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches}
algorithm.filter_groups.metric=${filter_groups_metric}
algorithm.rollout_correction.rollout_is=${rollout_is}
algorithm.rollout_correction.rollout_is_threshold=${rollout_is_threshold}
algorithm.filter_groups.max_num_gen_batches=10
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz}
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len}
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
actor_rollout_ref.model.path="${MODEL_PATH}"
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.optim.lr_warmup_steps=10
actor_rollout_ref.actor.optim.weight_decay=0.1
actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz}
actor_rollout_ref.actor.fsdp_config.param_offload=${offload}
actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload}
actor_rollout_ref.actor.entropy_coeff=0
actor_rollout_ref.actor.grad_clip=1.0
actor_rollout_ref.nccl_timeout=14400
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode}
actor_rollout_ref.rollout.gpu_memory_utilization=0.80
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp}
actor_rollout_ref.rollout.enable_chunked_prefill=True
actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length))
actor_rollout_ref.rollout.temperature=${temperature}
actor_rollout_ref.rollout.top_p=${top_p}
actor_rollout_ref.rollout.top_k="${top_k}"
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature}
actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p}
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k}
actor_rollout_ref.rollout.val_kwargs.do_sample=True
actor_rollout_ref.rollout.val_kwargs.n=32
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.ref.fsdp_config.param_offload=${offload}
actor_rollout_ref.actor.fsdp_config.fsdp_size=-1
reward_model.reward_manager=dapo
reward_model.format_reward_cfg.enable=True
reward_model.overlong_buffer.enable=${enable_overlong_buffer}
reward_model.overlong_buffer.len=${overlong_buffer_len}
reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor}
trainer.logger='["console","swanlab"]'
trainer.project_name="${project_name}"
trainer.experiment_name="${experiment_name}"
trainer.n_gpus_per_node=8
trainer.nnodes="${NNODES}"
trainer.val_before_train=False
trainer.test_freq=10
trainer.save_freq=40
trainer.max_actor_ckpt_to_keep=5
trainer.log_val_generations=10
trainer.total_epochs=10
trainer.default_local_dir="${CKPTS_DIR}"
trainer.rollout_data_dir=checkpoints/RL/data/$experiment_name
trainer.resume_mode=disable
$@

Expected behavior

Execute the length penalty code

Nov 24 '25 10:11 dtl123456

This part has been moved to verl/experimental/reward/reward_loop/dapo.py.

Nov 27 '25 06:11 yyDing1

Thank you for your response. However, I didn't understand why Verl moved this part of the code to this specific path. Was the code in the "recipe" path no longer needed?

Nov 27 '25 12:11 dtl123456

Correct a potential misunderstanding. The implementation of the DAPO reward manager has consistently remained under the verl/ directory, i.e., verl/workers/reward_manager/dapo.py

Nov 27 '25 16:11 yyDing1

ok，thanks

Nov 27 '25 16:11 dtl123456

dapo reward_manager

System Info

Information

Tasks

Reproduction

Expected behavior