Guanhua Wang

Results 21 comments of Guanhua Wang

> Hi all, I've got some time to circle back to this. I'm hoping someone on the team can take a look and provide some feedback when they get a...

Hi @FahriBilici , thx for raising this issue. In order to reproduce the error, could you also provide the training script you ran and command line (e.g., either use deepspeed...

Hi @deepcharm Thx for the PR. Just curious why allreduce could be faster than allgather? allreduce basically is doing reduce-scatter + all-gather. Could we just make allgather as coalesced version...

> > Hi @deepcharm > > Thx for the PR. Just curious why allreduce could be faster than allgather? allreduce basically is doing reduce-scatter + all-gather. Could we just make...

Hi @Suparjie , thx for raise this issue. I believe the reduction stream and default compute stream are not sync properly. by adding your above `stream.wait_stream(get_accelerator().current_stream()) `, I am wondering...

Sorry I don't think it is the correct fix. forcing stream.synchronize() meaning the corresponding nccl call will be blocking call and not overlap with subsequent compute.

Hi @xxtars , we noticed this accuracy issue in 14.0 (some of our user also falled back to 12.3) and did several fixes on accuracy later on. Could you try...

Hi Currently zero++ feature does not support for bf16 quantization, I suppose that is the root cause of this issue. To fix it, you can **Either** use `fp16` as dtype...

Hi @HeyangQin , I believe we don't support bf16 training, which cause the `RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half`