DeepSpeed
DeepSpeed copied to clipboard
Resolve hard dependency on MOE for contiguous_gradients on stage 1
Motivation for This PR: In engine.py there is a dependency on contiguous_gradients on MoE, for Stage1 which would imply that even with "contiguous_geadients" enabled, Stage 1 would still default to "buffered_allreduce" during reduce_ipg_grads. If MoE is disabled (no experts) , then even with contiguous_gradients set, we would have to see allreduce pathway for Stage 1; which is a hard dependency on MoE. So this PR is to retain the condition that irrespective of MoE , contiguous_gradients if set should remain set on deepspeed. In this case, without MoE and inducing contiguous_gradients=True allows stage 1 to show reduce collective in place of allreduce (as it should). With MoE experts set it also goes through the reduce path (does not break functionality). Also this is for issues : #622 #264 #1300
@tjruwase requesting review .
Ps: CLA is signed .
@tjruwase could you please review ? cont_gradients should not be hard binded with Moe in engine , as this would mean enabling Moe layers everytime to see cont_grads. This also affects performance, hence seeking your review on this.
@abhilash1910 -- Thanks for the PR! Have you tested performance and found issues? It will be helpful to add some more details if you can.
@RezaYazdaniAminabadi, @jeffra, and @tjruwase - I think we should accept this PR. Do you guys remember we did something like this but never pushed a public PR? Thoughts?
Thanks @awan-10 , I found that decoupling the moe provides reduce collective ; I am checking for perf if there is any impact . I will add the log file for the perf characterisations . But moe should be detachable from contiguous gradients.