How to set dp MoE

Open zhaozheng09 opened this issue 4 months ago • 1 comments

I want to set 12 experts and select top 4 per gpu. I set parallel_type == 1, I find a2a in time. I set parallel_type == 0, I find allgather in timeline .

I only want to dp Moe per gpu .

        from tutel import moe as tutel_moe
        self.ff_out = tutel_moe.moe_layer(
            gate_type={'type': 'top', 'k': 4},
            model_dim=512,
            experts={
                'num_experts_per_device': 12,
                'type': 'ffn', 'hidden_size_per_expert': 2048, 'activation_fn': lambda x: torch.nn.functional.relu(x)
            },
            parallel_type='data',
            scan_expert_func = lambda name, param: setattr(param, 'skip_allreduce', True),
        )

Aug 24 '25 12:08 zhaozheng09

Hello, dp is parallel_type == 0 using all_gather for ZeRO-2. This type is usually slower especially when the expert parameters is larger than activation sizes.

Aug 26 '25 07:08 ghostplant