snowfall
snowfall copied to clipboard
Initialization of last layer to zero
Guys, I just remembered a trick that we used to use in Kaldi to help models converge early on, and I tried it on a setup that was not converging great and it has a huge effect. I want to remind you of this (I don't have time to try it on one of our standard setups just now). It's just to set the last layer's parameters to zero.
def __init__(self):
<snip>
self.final_conv1d = nn.Conv1d(dim, num_classes, stride=1, kernel_size=1, bias=True)
self.reset_parameters()
def reset_parameters(self):
torch.nn.init.constant_(self.final_conv1d.weight, 0.)
torch.nn.init.constant_(self.final_conv1d.bias, 0.)
Mm, on the master branch with transformer, this gives an OOM error. We need to have some code LFMmiLoss to conditionally prune the lattices more if they are too large. @csukuangfj can you point me to any code that does this?
@danpovey Please see https://github.com/k2-fsa/snowfall/blob/ed4c74a210e005d8ed9e767a96b70b79271ab002/snowfall/decoding/lm_rescore.py#L262-L281
It is from #147
That's a cool trick. Why does it work?
M actually in snowfall, now that I test properly, it's not clear that it's working. It's OK to leave one layer uninitialized, derivs will still be nonzero.