yanwj21

Results 3 comments of yanwj21

I also meet nccl time out error when finetuning MOE.

> I think it's related to multinode fine tuning. cc [@John-Ge](https://github.com/John-Ge) . When I change my GPU from PPU to A100 (environment may also changed), the problem seems solved, but...

> We’ve encountered this issue before in a multinode setup. If it happens again, please feel free to report it—we’ll investigate thoroughly. Forgot to mention that I didn't adopt a...