sync between multi gpus
here is my code: ` def save_checkpoint(self, global_step: int, epoch: int, loss: float): # 定义检查点目录路径 checkpoint_dir = self.output_dir / f'checkpoint-e-{epoch}-s-{global_step}'
# 保存所有被 aacelerator.prepare() 处理过的对象的状态
self.accelerator.save_state(checkpoint_dir)
metadata = {
"global_step": global_step,
"epoch": epoch,
"loss": loss
}
with open(checkpoint_dir / "metadata.json", "w") as f:
json.dump(metadata, f)
` When using accelerate and deepspeed zero2, an error is reported when saving the model state
here is error message:
my GPU is H20 * 4, system platform is ubuntu accelecte version: 1.7.0
Cannot reproduce this on my end. Could something be wrong with your cluster config?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.