dapo reward_manager
System Info
In the new version of the code, the calculation of reward will be included in the generation of the sequence, which results in the logic of overlong_buffer_cfg being skipped.
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
ray job submit --runtime-env="${RUNTIME_ENV}"
--working-dir "${WORKING_DIR}"
-- python3 -m recipe.dapo.main_dapo
data.train_files="${TRAIN_FILE}"
data.val_files="${TEST_FILE}"
data.prompt_key=prompt
data.truncation='left'
data.seed=42
data.max_prompt_length=${max_prompt_length}
data.max_response_length=${max_response_length}
data.train_batch_size=${train_prompt_bsz}
actor_rollout_ref.rollout.n=${n_resp_per_prompt}
actor_rollout_ref.rollout.dtype=${dtype}
+actor_rollout_ref.actor.fsdp_config.mixed_precision.param_dtype=${dtype}
actor_rollout_ref.actor.fsdp_config.dtype=${dtype}
actor_rollout_ref.ref.fsdp_config.dtype=${dtype}
algorithm.adv_estimator=${adv_estimator}
algorithm.use_kl_in_reward=${use_kl_in_reward}
algorithm.kl_ctrl.kl_coef=${kl_coef}
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss}
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef}
actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low}
actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high}
actor_rollout_ref.actor.clip_ratio_c=10.0
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8
algorithm.filter_groups.enable=${enable_filter_groups}
algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches}
algorithm.filter_groups.metric=${filter_groups_metric}
algorithm.rollout_correction.rollout_is=${rollout_is}
algorithm.rollout_correction.rollout_is_threshold=${rollout_is_threshold}
algorithm.filter_groups.max_num_gen_batches=10
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz}
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len}
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
actor_rollout_ref.model.path="${MODEL_PATH}"
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.optim.lr_warmup_steps=10
actor_rollout_ref.actor.optim.weight_decay=0.1
actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz}
actor_rollout_ref.actor.fsdp_config.param_offload=${offload}
actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload}
actor_rollout_ref.actor.entropy_coeff=0
actor_rollout_ref.actor.grad_clip=1.0
actor_rollout_ref.nccl_timeout=14400
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode}
actor_rollout_ref.rollout.gpu_memory_utilization=0.80
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp}
actor_rollout_ref.rollout.enable_chunked_prefill=True
actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length))
actor_rollout_ref.rollout.temperature=${temperature}
actor_rollout_ref.rollout.top_p=${top_p}
actor_rollout_ref.rollout.top_k="${top_k}"
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature}
actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p}
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k}
actor_rollout_ref.rollout.val_kwargs.do_sample=True
actor_rollout_ref.rollout.val_kwargs.n=32
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.ref.fsdp_config.param_offload=${offload}
actor_rollout_ref.actor.fsdp_config.fsdp_size=-1
reward_model.reward_manager=dapo
reward_model.format_reward_cfg.enable=True
reward_model.overlong_buffer.enable=${enable_overlong_buffer}
reward_model.overlong_buffer.len=${overlong_buffer_len}
reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor}
trainer.logger='["console","swanlab"]'
trainer.project_name="${project_name}"
trainer.experiment_name="${experiment_name}"
trainer.n_gpus_per_node=8
trainer.nnodes="${NNODES}"
trainer.val_before_train=False
trainer.test_freq=10
trainer.save_freq=40
trainer.max_actor_ckpt_to_keep=5
trainer.log_val_generations=10
trainer.total_epochs=10
trainer.default_local_dir="${CKPTS_DIR}"
trainer.rollout_data_dir=checkpoints/RL/data/$experiment_name
trainer.resume_mode=disable
$@
Expected behavior
Execute the length penalty code
This part has been moved to verl/experimental/reward/reward_loop/dapo.py.
Thank you for your response. However, I didn't understand why Verl moved this part of the code to this specific path. Was the code in the "recipe" path no longer needed?
Correct a potential misunderstanding. The implementation of the DAPO reward manager has consistently remained under the verl/ directory, i.e., verl/workers/reward_manager/dapo.py
ok,thanks