TCFormer icon indicating copy to clipboard operation
TCFormer copied to clipboard

Gradient Vanishing or Explosion when integrating TCFormer into HRNet codebase.

Open pengzhansun opened this issue 1 year ago • 1 comments

Dear Authors,

Thanks for making this solid work publicly available.

I am trying to integrate this method into this popular codebase (HRNet) but encountered with gradient vanishing or explosion problems during training stage. [NaN or Inf found in input tensor.] It happens at around 49 epoch like:

Epoch: [49][600/4682] Time 0.274s (0.406s) Speed 116.8 samples/s Data 0.000s (0.012s) Loss 0.00058 (0.00060) Accuracy 0.638 (0.668) Epoch: [49][700/4682] Time 0.409s (0.402s) Speed 78.2 samples/s Data 0.000s (0.012s) Loss 0.00052 (0.00060) Accuracy 0.672 (0.667) Epoch: [49][800/4682] Time 0.522s (0.403s) Speed 61.3 samples/s Data 0.000s (0.011s) Loss nan (nan) Accuracy 0.000 (0.628) NaN or Inf found in input tensor. Epoch: [49][900/4682] Time 0.534s (0.397s) Speed 59.9 samples/s Data 0.000s (0.011s) Loss nan (nan) Accuracy 0.000 (0.558) NaN or Inf found in input tensor.

I would like to know have you encountered with similar problems or is TCFormer less stable than other methods like HRNet and SimpleBaseline?

pengzhansun avatar Jul 07 '23 13:07 pengzhansun

We haven't met similar problems. It reads "NaN or Inf found in input tensor." how about check the input again?

jin-s13 avatar Jul 21 '23 06:07 jin-s13