Guanhua Wang comments

Results 21 comments of


                                            Guanhua Wang

parallelize writing of layer checkpoint files across data parallel instances

> Hi all, I've got some time to circle back to this. I'm hoping someone on the team can take a look and provide some feedback when they get a...

[BUG] RuntimeError: Tensors must be contiguous error while finetuning with deepspeed.

Hi @FahriBilici , thx for raising this issue. In order to reproduce the error, could you also provide the training script you ran and command line (e.g., either use deepspeed...

[BUG] RuntimeError: Tensors must be contiguous error while finetuning with deepspeed.

closed for now, feel free to reopen if needed.

Optimize zero3 fetch params using all_reduce

Hi @deepcharm Thx for the PR. Just curious why allreduce could be faster than allgather? allreduce basically is doing reduce-scatter + all-gather. Could we just make allgather as coalesced version...

Optimize zero3 fetch params using all_reduce

> > Hi @deepcharm > > Thx for the PR. Just curious why allreduce could be faster than allgather? allreduce basically is doing reduce-scatter + all-gather. Could we just make...

[BUG] Why the results were inconsistent in two identical tests with config zero2 + overlap_comm

Hi @Suparjie , thx for raise this issue. I believe the reduction stream and default compute stream are not sync properly. by adding your above `stream.wait_stream(get_accelerator().current_stream()) `, I am wondering...

[BUG] Why the results were inconsistent in two identical tests with config zero2 + overlap_comm

Sorry I don't think it is the correct fix. forcing stream.synchronize() meaning the corresponding nccl call will be blocking call and not overlap with subsequent compute.

[BUG] Grad_norm is nan and Loss is 0

Hi @xxtars , we noticed this accuracy issue in 14.0 (some of our user also falled back to 12.3) and did several fixes on accuracy later on. Could you try...

[BUG] `zero_quantized_nontrainable_weights=True` when using PEFT+DeepSpeed with Mixed-Precision training using BF16 leads to `float != c10::BFloat16` error

Hi Currently zero++ feature does not support for bf16 quantization, I suppose that is the root cause of this issue. To fix it, you can **Either** use `fp16` as dtype...

[BUG] ZeRO++ is broken: `zero_quantized_weights` fails

Hi @HeyangQin , I believe we don't support bf16 training, which cause the `RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half`