Shuai Zheng issues

Results 7 issues of


                                            Shuai Zheng

Fix checkpoint loading when zero optimizer states are not given

When I use DeepSpeed for the finetuning without giving the zero state checkpoints, the FP32 master parameter is not initialized properly. This PR fixes this issue.

Data Parallel Group as in Pytorch Distributed

Pytorch distributed has a feature that allows users to define data parallel group https://pytorch.org/docs/stable/distributed.html. This feature is very useful when we use the model parallelism. And, we could possibly use...

Parameter fusion in optimizer partition makes lamb behaves differently

In optimizer partition, the parameters are fused into a big vector and then get partitioned over workers. So the number of chunks can be much lesser than the number of...

[BUG] ZeRO 1 and ZeRO 2 produces different losses

**Describe the bug** ZeRO 1 and 2 give different losses for the same setting. Similar issue was reported in https://github.com/microsoft/DeepSpeed/issues/966#issuecomment-829516471 **To Reproduce** Run `deepspeed test.py --zero 1` and `deepspeed test.py...

bug

training

[BUG] CUDA error: an illegal memory access was encountered with Adam optimizer on H100

**Describe the bug** On H100 SXM5, Adam optimizer kernel standalone would lead to `CUDA error: an illegal memory access was encountered ` with certain tensor size such as `2359332864`. The...

bug

training

[BUG] CUDA error: an illegal memory access was encountered with Adam optimizer on H100

**Describe the Bug** On H100 SXM5, Adam optimizer kernel standalone results in CUDA error: an illegal memory access was encountered with certain tensor size such as 2359332864. The GPU has...

bug

Helper scripts are not called when the node fails the health check with Slurm

Hi, I am testing nhc with Slurm to automatically drain the nodes with ECC uncorrectable error. The nhc log shows the health check fails on the problematic node, but no...