Qianshaowei issues

Results 3 issues of


                                            Qianshaowei

Results of experimental reproduction are lower than reported

Hi, thank you for your excellent work! During the experiment, I can't reproduce the results reported in the paper using the provided code. The table shows the experimental results. The...

multi-node problem

您好，我使用两台机器每台机器8卡，且使用1个专家，top_k=1 model = FMoETransformerMLP(num_expert=1,d_model=d_model,d_hidden=d_model, world_size =torch.distributed.get_world_size(),top_k=1) 训练伪代码： backbone_ddp = fmoe.DistributedGroupedDataParallel(model,device_ids) .... .... backbone_ddp.allreduce_params() optm.step() 这样应该是16*1个专家并行吧？ `File "/usr/local/python3.7.1/lib/python3.7/site-packages/fastmoe-1.1.0-py3.7-linux-x86_64.egg/fmoe/gates/naive_gate.py", line 33, in forward` `gate, k=self.top_k, dim=-1, largest=True, sorted=False RuntimeError: invalid argument 5:...

Questions about moe examples.

Hi, I'm unable to run through the MOE sample.(test_moe_top.py) The error message is as follows： ``` 2024-04-26 15:13:06,594 - __main__ - INFO - Training MoE Examples on HETU libibverbs: Warning:...