ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[checkpoint_io] Fix gather_state_dict_fast

Open pbelevich opened this issue 7 months ago • 1 comments

🚨 Issue number

fixed #6248

📝 What does this PR do?

This PR fixed 2 issues:

  • For reqs = dist.batch_isend_irecv(ops) it's not guaranteed that len(ops) == len(reqs) (see batch_isend_irecv implementation) so having a single for loop over zip(reqs, target_metadata) is incorrect
  • batch_isend_irecv hangs in some cases https://github.com/pytorch/pytorch/issues/116590

⭐️ Do you enjoy contributing to Colossal-AI?

  • [x] 🌝 Yes, I do.
  • [ ] 🌚 No, I don't.

cc @ver217 @kwen2501

pbelevich avatar May 28 '25 16:05 pbelevich

@pbelevich Thanks, I believe this is a correct fix for bullet 1, but could you explain a bit where the hanging issue happened and why this resolves it?

botbw avatar Aug 13 '25 07:08 botbw