transformer-xl
transformer-xl copied to clipboard
Sensitivity to initial weights causing NANs?
Hi, I'm getting NAN values in the first forward pass of the model (in the first layer), generally caused by the first AC calculation. I'm wondering if this is an issue with the initial weights of the model? If so, any advice to help with this issue? I have made some changes to the model and this will help me determine if this is a known issue or if I have introduced a bug. Thanks.
This seldom happens. With the given hyper-parameters, this actually should not happen. However, when div_val > 1, meaning reducing the word embedding dimensionality by div_val times for infrequent words, this could happen with low probability according to my experience. If this happens to you, try using div_val = 1 or using smaller initial weights by decreasing init_range or init_std. Hope this helps.