ColossalAI
ColossalAI copied to clipboard
[checkpoint_io] Fix gather_state_dict_fast
🚨 Issue number
fixed #6248
📝 What does this PR do?
This PR fixed 2 issues:
- For
reqs = dist.batch_isend_irecv(ops)it's not guaranteed that len(ops) == len(reqs) (see batch_isend_irecv implementation) so having a single for loop over zip(reqs, target_metadata) is incorrect - batch_isend_irecv hangs in some cases https://github.com/pytorch/pytorch/issues/116590
⭐️ Do you enjoy contributing to Colossal-AI?
- [x] 🌝 Yes, I do.
- [ ] 🌚 No, I don't.
cc @ver217 @kwen2501
@pbelevich Thanks, I believe this is a correct fix for bullet 1, but could you explain a bit where the hanging issue happened and why this resolves it?