Stas Bekman
Stas Bekman
so @HugoLaurencon figured it out - the culprit was `communication_data_type==fp16` w/ dtype bf16 - wrong numerical range >64k is Inf. so large grads were becoming Inf during reduction. To help...
> @stas00, did we agree that communication data type for fp16 should remain fp16 because for BC sake? Yes, but also because it works. It is scaled and shouldn't overflow...
whoah! This is a priceless sharing, @GuanhuaWang - XieXie! Can we do a variation for bf16, which is absolutely taking over fp16 as we speak for LLM. Please note that...
@GuanhuaWang, @jeffra - let's revive this thread and give it a higher priority if you're willing to support that - the main question I'm being asked very often these days...
Also as I started reproducing this math, there are many more things to take into an account here with regards to the 3x multiplier. which in https://github.com/microsoft/DeepSpeed/issues/2928#issuecomment-1463041491 is 2b+2b+2b (fwd+bwd+grad)....
I was able to confirm with @samyam that dividing by the number of Nodes is incorrect in the math of https://github.com/microsoft/DeepSpeed/issues/2928#issuecomment-1463041491 You can find the correct math here: https://github.com/stas00/ml-engineering/tree/master/model-parallelism#inter-node-speed-requirements-to-use-zero
Yes, that would be another way on how to do it. I just thought that as this is very rarely going to be used it won't be worthwhile adding to...
If you're using Deepspeed@Accelerate, it should work fine there since summer. Try the approach shown in this reply: https://github.com/huggingface/accelerate/issues/1401#issuecomment-1543257739 I have just tried the code from above and it works...
of course - the 2 PRs are orthogonal to each other. - recent pytorch introduced deprecations and created 2 new APIs, so this current PR switches to using those. -...
oh, great minds think alike - apologies I didn't know of your PR, Mayank. I totally don't care which version is merged. Yours is definitely a prior art, so if...