Stas Bekman
Stas Bekman
Let's merge this, Tunji?
can we try again please Tunji?
I just merged something on transformers a few hours ago - perhaps we broke something there - checking your CI
It looks like someone broke the `zero_to_fp32.py` file - it now starts with: ``` '''Copyright The Microsoft DeepSpeed Team''' #!/usr/bin/env python ``` should be starting with: ``` #!/usr/bin/env python ```...
This is the source of breakage https://github.com/microsoft/DeepSpeed/pull/2889 - @jeffra If I revert that commit, everything works. If you merge this https://github.com/microsoft/DeepSpeed/pull/2909 it should work
`7*(8+4)=84` - so you should have a 84GB universal checkpoint. (8+4 optim states+weights) Can you check with `du -s` where do you get the bloat? e.g. ``` du -ahd1 /path...
Thank you for the details, @LiinXemmon @tjruwase, is it possible we are hitting the tensor bloat here as well? Same as you're fixing in https://github.com/microsoft/DeepSpeed/pull/3348 @LiinXemmon, could you please confirm...
> The sweep results suggest that zero-infinity can be configured to do offloading at read rate of 3GB/sec and write rate of 2.6GB/sec. So you want to configure the [asynchronous...
I looked again through your paper - please correct me if I'm wrong, but it looks like we need at least 3GB/s NVMECPU bandwidth per GPU, so really my NVME...
This helps a lot, thank you! Can we make `parse_aio_stats.py` take in both read and write reports and generate the recommended config for the user?