Shuai Zheng
Shuai Zheng
When I use DeepSpeed for the finetuning without giving the zero state checkpoints, the FP32 master parameter is not initialized properly. This PR fixes this issue.
Pytorch distributed has a feature that allows users to define data parallel group https://pytorch.org/docs/stable/distributed.html. This feature is very useful when we use the model parallelism. And, we could possibly use...
In optimizer partition, the parameters are fused into a big vector and then get partitioned over workers. So the number of chunks can be much lesser than the number of...
**Describe the bug** ZeRO 1 and 2 give different losses for the same setting. Similar issue was reported in https://github.com/microsoft/DeepSpeed/issues/966#issuecomment-829516471 **To Reproduce** Run `deepspeed test.py --zero 1` and `deepspeed test.py...
**Describe the bug** On H100 SXM5, Adam optimizer kernel standalone would lead to `CUDA error: an illegal memory access was encountered ` with certain tensor size such as `2359332864`. The...
**Describe the Bug** On H100 SXM5, Adam optimizer kernel standalone results in CUDA error: an illegal memory access was encountered with certain tensor size such as 2359332864. The GPU has...
Hi, I am testing nhc with Slurm to automatically drain the nodes with ECC uncorrectable error. The nhc log shows the health check fails on the problematic node, but no...