accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

sync between multi gpus

Open Reginald-L opened this issue 7 months ago • 1 comments

here is my code: ` def save_checkpoint(self, global_step: int, epoch: int, loss: float): # 定义检查点目录路径 checkpoint_dir = self.output_dir / f'checkpoint-e-{epoch}-s-{global_step}'

    # 保存所有被 aacelerator.prepare() 处理过的对象的状态
    self.accelerator.save_state(checkpoint_dir)
    
    metadata = {
        "global_step": global_step,
        "epoch": epoch,
        "loss": loss
    }
    with open(checkpoint_dir / "metadata.json", "w") as f:
        json.dump(metadata, f)

` When using accelerate and deepspeed zero2, an error is reported when saving the model state

here is error message:

Image

my GPU is H20 * 4, system platform is ubuntu accelecte version: 1.7.0

Reginald-L avatar Jun 09 '25 12:06 Reginald-L

Cannot reproduce this on my end. Could something be wrong with your cluster config?

S1ro1 avatar Jun 23 '25 12:06 S1ro1

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 17 '25 15:07 github-actions[bot]