megablocks Routing

Routing

Open alexliap opened this issue 7 months ago • 1 comments

Is the router implemented the noisy top k routing suggested by the OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER paper?

In the router code you seem to apply the noise at the input of the router and not at the router scores like in the paper above:

 def forward(self, x):
        if self.training and self.args.moe_jitter_eps is not None:
            x = x * self.jitter(x)

        scores = self.layer(x.view(-1, x.shape[-1])).softmax(dim=-1)
        expert_weights, expert_indices = self._top_k(scores)
        if self.args.moe_normalize_expert_weights:
            expert_weights = expert_weights / torch.norm(
                expert_weights, p=self.args.moe_normalize_expert_weights,dim=-1, keepdim=True)

        expert_indices = (
            _uniform_expert_assignment(expert_indices, self.args.moe_num_experts)
            if self.args.uniform_expert_assignment else expert_indices
        )
        return scores, expert_weights, expert_indices

In the aforementioned paper the noisy top k works like:

Is this somehting equivalent? I am not trying to argue that it is wrong, but i was just trying to figure out if this is the same.

Jun 27 '24 19:06 alexliap

megablocks megablocks copied to clipboard

Routing

megablocks
megablocks copied to clipboard