save_checkpoint OOM using A5880*8(48G), Qwen2.5-7B

Open zyf8818 opened this issue 4 months ago • 2 comments

When GRPO training, saving the checkpoint, an OOM error occurred, as shown in the following figure.

I understand that is caused by all 8 nodes simultaneously save the file. Is there a way to make the 8 nodes save the file sequentially, or is it possible to only save the content of one node?

Aug 01 '25 04:08 zyf8818

Same Problem

Aug 13 '25 15:08 Lan13

try ‘ actor_rollout_ref.actor.fsdp_config.optimizer_offload = False ’

Nov 19 '25 09:11 Yumaokk