Yizhou Wang
Yizhou Wang
@loadams Hi, could you please help me trigger the CI? My CLA was reviewed and passed today. Thank you!
@tjruwase Hi, We found a bug in DeepSpeed that when enabling tensor parallel = 2 on Megatron-DeepSpeed 20B 4nodes, would meet below error: _RuntimeError: Global rank 0 is not part...
> @YizhouZ, thanks for this PR. Apologies for the delay as we resolve some CI issues. We plan to merge soon. @tjruwase Thanks!
> @YizhouZ, do you know why this is not a problem for zero stage 1 or 2? Hi @tjruwase only stage 3 would trigger this post_init_method, others would not go...
> @YizhouZ, thanks for confirmation. That makes sense since TP>1 is not very well tested with ZeRO stage 3. This certainly shows a gap in our unit tests. > >...
@tjruwase Could you please help me trigger the CI? My CLA was reviewed and passed today. Thank you!
@tjruwase Fixed CI failed case. Please help to check it. Thank you!
Hi @tjruwase, it seems the current CI failure is not triggered by my changes, I see the previous check is passed but the latest one is failed and the difference...
Thanks for triggering CI. Do you have comments on this PR? @loadams @tjruwase
> Thanks for triggering CI. Do you have comments on this PR? @loadams @tjruwase Hi @loadams @tjruwase, this PR seems not in the merge queue. Could you give us some...