Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[QUESTION] About all_reduce order while using CP

Open junjzhang opened this issue 1 year ago • 1 comments

I noticed that CP grad is also reduced in DDP, indicating that grads are first reduced among micro batches then the CP group, thereby minimizing communication costs. However, the reduction order differs: in this case, it's splited seq -> micro batch -> CP group, whereas on a single card, it’s split seq -> CP group -> micro batch. I believe this difference may lead to errors due to floating-point addition. Is my assessment correct, or is this error acceptable in practice?

junjzhang avatar Sep 27 '24 09:09 junjzhang

@jaredcasper @deepakn94 Could you help me on this one?

junjzhang avatar Sep 28 '24 04:09 junjzhang

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Nov 27 '24 18:11 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Jul 31 '25 02:07 github-actions[bot]