fastmoe
fastmoe copied to clipboard
During inference, the output of noisy gate is nan.
The training process proceeds smoothly; however, an issue arises during inference as the noise_stddev becomes zero when self.training is False, leading to an error when computing the load. Should we refrain from adding noise in the NoisyGate during inference?
@Sengxian Can you please shed some light on why we are multiplying the noise with self.training
here?
I suppose it should be raw_noise * training + eps
instead of (raw_noise + eps) * training
Do I accurately comprehend your statement: noise_stddev = self.softplus(raw_noise_stddev) * self.training + self.noise_epsilon
?
Do I accurately comprehend your statement:
noise_stddev = self.softplus(raw_noise_stddev) * self.training + self.noise_epsilon
?
Yes, I think that can help fixing your nan issue. But as I am not an algiorithm person, I am not sure if this is what the nosiy gate is expected to behave for inference.
Thank you for your help