Stas Bekman
Stas Bekman
@tohtana, thank you for implementing my suggestions. I haven't tested the code but looking at your tests this looks good. I agree that doing this work incrementally is a good...
@tjruwase, would it be possible to implement this? We are ready to start using ZeRO-3/bf16 for the multi-modal training. Thank you very much!
Hi Tunji - you're the owner so it's up to you to decide. The new one is a duplicate of this one, so typically the earliest one stays. And I...
Fantastic news, Tunji. Thank you. And, yes, we would be happy to experiment with your WIP PR.
Please see: https://github.com/microsoft/DeepSpeed/pull/2847
Thank you very much for working on this important feature, Joe!
This is much easier to follow, @jomayeri - thank you for the renames!
@jomayeri, if I may share a useful gh trick with you - you might want to add: ``` Fixes: https://github.com/microsoft/DeepSpeed/issues/2352 ``` to the OP and it'll automatically link to close...
> @stas00, I am curious what happened to the training in this case, since there is no loss scaling for non-fp16. Did the subsequent iterations continue to report overflows? We...
Don't have a small repro yet, if I do I will let you know. My hypotheses is that grad clipping is not numerically stable, since we checked with the backward...