yanwj21
Results
3
comments of
yanwj21
I also meet nccl time out error when finetuning MOE.
> I think it's related to multinode fine tuning. cc [@John-Ge](https://github.com/John-Ge) . When I change my GPU from PPU to A100 (environment may also changed), the problem seems solved, but...
> We’ve encountered this issue before in a multinode setup. If it happens again, please feel free to report it—we’ll investigate thoroughly. Forgot to mention that I didn't adopt a...