Does the current framework support resuming training from a checkpoint? I don't seem to see any options for restarting the training (such as logging data usage, optimizer state, etc.).
https://github.com/volcengine/verl/blob/62e23aee0b4e2c36f04a5b95fcd9f0a4eb724ee2/verl/trainer/ppo/ray_trainer.py#L751
This seems to implement resuming. However, I'm not sure whether it loads optimizer states. According to the PR it does mention optimizer states: https://github.com/volcengine/verl/pull/222
Thanks for the reminder!
I met an issue when resuming from a path ray.exceptions.RayTaskError(AttributeError): [36mray::main_task()[39m (pid=7233, ip=172.16.17.3) File "/home/pj24002027/ku40001342/code/CP-Zero/verl/trainer/main_ppo.py", line 261, in main_task trainer.fit() File "/home/pj24002027/ku40001342/code/CP-Zero/verl/trainer/ppo/ray_trainer.py", line 719, in fit self._load_checkpoint() File "/home/pj24002027/ku40001342/code/CP-Zero/verl/trainer/ppo/ray_trainer.py", line 673, in _load_checkpoint self.actor_rollout_wg.load_checkpoint(actor_path) AttributeError: 'RayWorkerGroup' object has no attribute 'load_checkpoint'
I'm not sure if my train script is correct or not
trainer.resume_mode=global_step_2800
trainer.resume_from_path=$CHECKPOINT_PATH
since there is no latest path so I just specify the latest step: 2800.
While the issue happens when I want to resume from checkpoint path
@physicsru It seems that you didn't fetch the latest main branch? The load_checkpoint func can be found in fsdp_workers.py
@physicsru It seems that you didn't fetch the latest main branch? The load_checkpoint func can be found in fsdp_workers.py
Thank you so much!
@physicsru It seems that you didn't fetch the latest main branch? The load_checkpoint func can be found in fsdp_workers.py
I check the fsdp_checkpoint_manager.py and there is an load_checkpoint function in this class.
While I find self.actor_rollout_wg's type is <verl.single_controller.ray.base.RayWorkerGroup>
Could you help me with figure it out when you are available?
optimizer state is resumed in this PR: https://github.com/volcengine/verl/pull/216/files @physicsru what's the issue you're running into?
https://verl.readthedocs.io/en/latest/advance/checkpoint.html