traincheck-team

Results 12 comments of traincheck-team

@loadams Hi Logan, I apologize for the late reply. I’ve reviewed the 9 unit test failures in the recent workflow run: https://github.com/deepspeedai/DeepSpeed/actions/runs/13205140637/job/36866442471. My understanding is that these failures are caused...

I found the following code segment that seems to be linked to the reported behavior. The root cause is the non-atomic snapshot write in `save()` combined with the one-way rename...