Stas Bekman comments

Results 664 comments of


                                            Stas Bekman

[BUG] overflow warning needs to be different for fp16 and non-fp16

so @HugoLaurencon figured it out - the culprit was `communication_data_type==fp16` w/ dtype bf16 - wrong numerical range >64k is Inf. so large grads were becoming Inf during reduction. To help...

[BUG] overflow warning needs to be different for fp16 and non-fp16

> @stas00, did we agree that communication data type for fp16 should remain fp16 because for BC sake? Yes, but also because it works. It is scaled and shouldn't overflow...

[math] what network throughput is required to handle ZeRO-3 traffic?

whoah! This is a priceless sharing, @GuanhuaWang - XieXie! Can we do a variation for bf16, which is absolutely taking over fp16 as we speak for LLM. Please note that...

[math] what network throughput is required to handle ZeRO-3 traffic?

@GuanhuaWang, @jeffra - let's revive this thread and give it a higher priority if you're willing to support that - the main question I'm being asked very often these days...

[math] what network throughput is required to handle ZeRO-3 traffic?

Also as I started reproducing this math, there are many more things to take into an account here with regards to the 3x multiplier. which in https://github.com/microsoft/DeepSpeed/issues/2928#issuecomment-1463041491 is 2b+2b+2b (fwd+bwd+grad)....

[math] what network throughput is required to handle ZeRO-3 traffic?

I was able to confirm with @samyam that dividing by the number of Nodes is incorrect in the math of https://github.com/microsoft/DeepSpeed/issues/2928#issuecomment-1463041491 You can find the correct math here: https://github.com/stas00/ml-engineering/tree/master/model-parallelism#inter-node-speed-requirements-to-use-zero

Stas Bekman

[BUG] overflow warning needs to be different for fp16 and non-fp16

[BUG] overflow warning needs to be different for fp16 and non-fp16

[math] what network throughput is required to handle ZeRO-3 traffic?

[math] what network throughput is required to handle ZeRO-3 traffic?

[math] what network throughput is required to handle ZeRO-3 traffic?

[math] what network throughput is required to handle ZeRO-3 traffic?

[feature request] unable to override `dist.init_process_group` timeout in under `zero.Init`

[feature request] unable to override `dist.init_process_group` timeout in under `zero.Init`

[comms] fix deprecations

[comms] fix deprecations