verl icon indicating copy to clipboard operation
verl copied to clipboard

Does the current framework support resuming training from a checkpoint? I don't seem to see any options for restarting the training (such as logging data usage, optimizer state, etc.).

Open HCHCXY opened this issue 10 months ago • 7 comments

HCHCXY avatar Feb 14 '25 01:02 HCHCXY

https://github.com/volcengine/verl/blob/62e23aee0b4e2c36f04a5b95fcd9f0a4eb724ee2/verl/trainer/ppo/ray_trainer.py#L751

This seems to implement resuming. However, I'm not sure whether it loads optimizer states. According to the PR it does mention optimizer states: https://github.com/volcengine/verl/pull/222

TonyLianLong avatar Feb 14 '25 04:02 TonyLianLong

Thanks for the reminder!

HCHCXY avatar Feb 14 '25 04:02 HCHCXY

I met an issue when resuming from a path ray.exceptions.RayTaskError(AttributeError): [36mray::main_task()[39m (pid=7233, ip=172.16.17.3) File "/home/pj24002027/ku40001342/code/CP-Zero/verl/trainer/main_ppo.py", line 261, in main_task trainer.fit() File "/home/pj24002027/ku40001342/code/CP-Zero/verl/trainer/ppo/ray_trainer.py", line 719, in fit self._load_checkpoint() File "/home/pj24002027/ku40001342/code/CP-Zero/verl/trainer/ppo/ray_trainer.py", line 673, in _load_checkpoint self.actor_rollout_wg.load_checkpoint(actor_path) AttributeError: 'RayWorkerGroup' object has no attribute 'load_checkpoint'

I'm not sure if my train script is correct or not trainer.resume_mode=global_step_2800
trainer.resume_from_path=$CHECKPOINT_PATH since there is no latest path so I just specify the latest step: 2800. While the issue happens when I want to resume from checkpoint path

physicsru avatar Feb 15 '25 09:02 physicsru

@physicsru It seems that you didn't fetch the latest main branch? The load_checkpoint func can be found in fsdp_workers.py

PeterSH6 avatar Feb 15 '25 09:02 PeterSH6

@physicsru It seems that you didn't fetch the latest main branch? The load_checkpoint func can be found in fsdp_workers.py

Thank you so much!

physicsru avatar Feb 15 '25 10:02 physicsru

@physicsru It seems that you didn't fetch the latest main branch? The load_checkpoint func can be found in fsdp_workers.py

I check the fsdp_checkpoint_manager.py and there is an load_checkpoint function in this class.

While I find self.actor_rollout_wg's type is <verl.single_controller.ray.base.RayWorkerGroup>

Could you help me with figure it out when you are available?

physicsru avatar Feb 15 '25 10:02 physicsru

optimizer state is resumed in this PR: https://github.com/volcengine/verl/pull/216/files @physicsru what's the issue you're running into?

eric-haibin-lin avatar Feb 23 '25 23:02 eric-haibin-lin

https://verl.readthedocs.io/en/latest/advance/checkpoint.html

eric-haibin-lin avatar Apr 06 '25 19:04 eric-haibin-lin