verl
verl copied to clipboard
save_checkpoint OOM using A5880*8(48G), Qwen2.5-7B
When GRPO training, saving the checkpoint, an OOM error occurred, as shown in the following figure.
I understand that is caused by all 8 nodes simultaneously save the file. Is there a way to make the 8 nodes save the file sequentially, or is it possible to only save the content of one node?
Same Problem
try ‘ actor_rollout_ref.actor.fsdp_config.optimizer_offload = False ’