Guanhua Wang
Guanhua Wang
> Hi all, I've got some time to circle back to this. I'm hoping someone on the team can take a look and provide some feedback when they get a...
Hi @FahriBilici , thx for raising this issue. In order to reproduce the error, could you also provide the training script you ran and command line (e.g., either use deepspeed...
closed for now, feel free to reopen if needed.
Hi @deepcharm Thx for the PR. Just curious why allreduce could be faster than allgather? allreduce basically is doing reduce-scatter + all-gather. Could we just make allgather as coalesced version...
> > Hi @deepcharm > > Thx for the PR. Just curious why allreduce could be faster than allgather? allreduce basically is doing reduce-scatter + all-gather. Could we just make...
Hi @Suparjie , thx for raise this issue. I believe the reduction stream and default compute stream are not sync properly. by adding your above `stream.wait_stream(get_accelerator().current_stream()) `, I am wondering...
Sorry I don't think it is the correct fix. forcing stream.synchronize() meaning the corresponding nccl call will be blocking call and not overlap with subsequent compute.
Hi @xxtars , we noticed this accuracy issue in 14.0 (some of our user also falled back to 12.3) and did several fixes on accuracy later on. Could you try...
Hi Currently zero++ feature does not support for bf16 quantization, I suppose that is the root cause of this issue. To fix it, you can **Either** use `fp16` as dtype...
Hi @HeyangQin , I believe we don't support bf16 training, which cause the `RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half`