Stas Bekman
Stas Bekman
https://github.com/huggingface/blog/pull/538
My tests were run on JeanZay HPC so it's possible their servers are somehow beefier hardware-wise? It is interesting that you both report the same speed with int8. @RezaYazdaniAminabadi, do...
Unfortunately I no longer have access to JeanZay so I can't retrieve any more data at the moment. > Could this be due to slow communication between GPUs? That's very...
the initial topology conversion was written for BF16Optimizer, but here you use zero stage=1, which I haven't worked with, so I have no experience with this use-case. Tagging @tjruwase who...
Honestly I'm not sure as I wasn't part of the data team. I remember they said that most likely the normal tokenizer should work, but it might be safer to...
> Could you please provide more details about the training of 1B7 or 3B or 7B1 models? I only worked on 176B so I'm not the right person to ask....
a small correction: that's not apex, but Deepspeed's top level optimizer doing the skipping.
Please see: https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#checkpoint-reshaping I know it's not generic at the moment, please let me know if you run into any difficulties following those instructions while adapting to your situation. @tjruwase,...
only 176B was trained on A100 and thus bf16 (z?), everything else was trained on V100s, thus fp16, thus z1
I wasn't part of this training, @TevenLeScao do you by chance know who did the bloom-3b training? and if it's possible to update to the deepspeed@master in the conda env,...