nnue-pytorch
nnue-pytorch copied to clipboard
Detect vanishing gradients
It might be desirable to monitor/detect vanishing gradients during training. Note that I of course mean "stochastic gradient" here, as estimated by the training samples used in the current epoch (maybe the current batch size is too small to excite all king/piece positions, so preferably the mean or max abs over a window of multiple epochs).
This would have detected the anomalies in the input layer (dead weights for some king positions) in vondele's run84run3, see #53.
Note that with GC (gradient centralization), we cannot resort to investigating a mere difference of two checkpoints, as the centralized gradient by definition contains a contribution equal to the mean of all gradient vectors over all neurons of a layer (see equation (1) of https://arxiv.org/pdf/2004.01461v2).
As a "work-around", continued training without GC (use_gc=False
in Ranger) on a checkpoint and then comparing/visualizing the difference between a later checkpoint should also do the trick I think.
See also https://discuss.pytorch.org/t/how-to-check-for-vanishing-exploding-gradients/9019