returnn PyTorch automatic inf/nan detection, collecting statistics

PyTorch automatic inf/nan detection, collecting statistics

Open albertz opened this issue 2 years ago • 1 comments

autograd.detect_anomaly detects inf/nan in the backward pass.

I want to have the same in the forward pass. With the possibility to whitelist a few special operations, modules or code blocks, e.g. masking attention energies to -inf, etc.

Maybe via a post forward hook for every module?

Or using sys.settrace to install some function which would inspect the locals?

Also see https://discuss.pytorch.org/t/detect-inf-nan-in-forward-pass/190514.

E.g. I have a model which gets NAN loss (so forward pass) (directly first step) and I want to know where it happens. (This is with AMP float16, so maybe https://github.com/pytorch/pytorch/issues/40497 is related. But this is only about AMP, not so much about the issue here on adding such detection.)

The same mechanism can then also be used to collect statistics on activations (mean, min, max, std, var, median, L2, whatever).

Note that the same for parameters is much easier, as we can simply iterate over them.

Oct 25 '23 11:10 albertz

Also see #1487 about collecting statistics in general.

Jan 03 '24 16:01 albertz

returnn returnn copied to clipboard

PyTorch automatic inf/nan detection, collecting statistics

returnn
returnn copied to clipboard