Stas Bekman comments

Results 664 comments of


                                            Stas Bekman

Enable torch.compile with ZeRO (Experimental)

@tohtana, thank you for implementing my suggestions. I haven't tested the code but looking at your tests this looks good. I agree that doing this work incrementally is a good...

[REQUEST] BF16 mixed precision => grad accum in fp32

@tjruwase, would it be possible to implement this? We are ready to start using ZeRO-3/bf16 for the multi-modal training. Thank you very much!

[REQUEST] BF16 mixed precision => grad accum in fp32

Hi Tunji - you're the owner so it's up to you to decide. The new one is a duplicate of this one, so typically the earliest one stays. And I...

[REQUEST] BF16 mixed precision => grad accum in fp32

Fantastic news, Tunji. Thank you. And, yes, we would be happy to experiment with your WIP PR.

[REQUEST] BF16 mixed precision => grad accum in fp32

Please see: https://github.com/microsoft/DeepSpeed/pull/2847

ZeRO Gradient Accumulation Dtype.

Thank you very much for working on this important feature, Joe!

ZeRO Gradient Accumulation Dtype.

This is much easier to follow, @jomayeri - thank you for the renames!

ZeRO Gradient Accumulation Dtype.

@jomayeri, if I may share a useful gh trick with you - you might want to add: ``` Fixes: https://github.com/microsoft/DeepSpeed/issues/2352 ``` to the OP and it'll automatically link to close...

[BUG] overflow warning needs to be different for fp16 and non-fp16

> @stas00, I am curious what happened to the training in this case, since there is no loss scaling for non-fp16. Did the subsequent iterations continue to report overflows? We...

[BUG] overflow warning needs to be different for fp16 and non-fp16

Don't have a small repro yet, if I do I will let you know. My hypotheses is that grad clipping is not numerically stable, since we checked with the backward...