nnue-pytorch icon indicating copy to clipboard operation
nnue-pytorch copied to clipboard

Detect vanishing gradients

Open ddobbelaere opened this issue 4 years ago • 0 comments

It might be desirable to monitor/detect vanishing gradients during training. Note that I of course mean "stochastic gradient" here, as estimated by the training samples used in the current epoch (maybe the current batch size is too small to excite all king/piece positions, so preferably the mean or max abs over a window of multiple epochs).

This would have detected the anomalies in the input layer (dead weights for some king positions) in vondele's run84run3, see #53.

Note that with GC (gradient centralization), we cannot resort to investigating a mere difference of two checkpoints, as the centralized gradient by definition contains a contribution equal to the mean of all gradient vectors over all neurons of a layer (see equation (1) of https://arxiv.org/pdf/2004.01461v2).

As a "work-around", continued training without GC (use_gc=False in Ranger) on a checkpoint and then comparing/visualizing the difference between a later checkpoint should also do the trick I think.

See also https://discuss.pytorch.org/t/how-to-check-for-vanishing-exploding-gradients/9019

ddobbelaere avatar Jan 31 '21 10:01 ddobbelaere