nbeatsx
nbeatsx copied to clipboard
[Question] SELU weights and dropout
Hi,
My name is Pablo Navarro. Your team and I have already exchanged a few mails over the wonderful paper you've made. Thanks again for the contribution.
Now that the code is released, I have a couple question over the implementation of the SELU activation function.
Weight init
For SELU, you force lecun_normal
which is in turn a pass
on the init_weights()
function:
def init_weights(module, initialization):
if type(module) == t.nn.Linear:
if initialization == 'orthogonal':
t.nn.init.orthogonal_(module.weight)
elif initialization == 'he_uniform':
t.nn.init.kaiming_uniform_(module.weight)
elif initialization == 'he_normal':
t.nn.init.kaiming_normal_(module.weight)
elif initialization == 'glorot_uniform':
t.nn.init.xavier_uniform_(module.weight)
elif initialization == 'glorot_normal':
t.nn.init.xavier_normal_(module.weight)
elif initialization == 'lecun_normal':
pass
else:
assert 1<0, f'Initialization {initialization} not found'
How come the weights are initialized as lecun_normal
simply by passing? On my machine, default PyTorch initializes weights uniformly, not normally.
DropOut on SELU
I believe that in order to make SELU useful, you need to use AlphaDropout()
instead of regular DropOut()
layers (PyTorch docs).
I can't find anything wrapping AlphaDropOut()
in your code. Can you point me in the right direction or give the rationale behind it?
Cheers and keep up the good work!
DropOut and AlphaDropOut on SELU
Thanks for the comments. As you mentioned from the paper of the scaled exponential linear units https://arxiv.org/abs/1706.02515, on page 6, they recommend not use dropout as the extra variance hinders the convergence of the algorithm when using normalization. We observed some convergence issues when exploring the hyperparameter space. Although with optimal model configurations, the training procedure was stable.
One thing to keep in mind is that the two best regularization techniques we found in our experiments are early stopping and second ensembling. Since ensembling boosts accuracy from the diversity and variance of models, the interaction of AlphaDropOut with the ensemble might be something interesting to explore. Still, we will try the AlphaDropOut regularization to test the SELU paper recommendation on this regression setting.