Swin-Transformer
Swin-Transformer copied to clipboard
About nsys profiing analysis using Swin-moe
I just modified some codes about making some fake inputs to train swin-moe, and exported a nsys profiling. What confused me is that why there are 7 Allreduce in backward per step, is there somebody tell me why? Thx very much!

I used 8 experts and 8 gpus on one node to train this.