CovPoolFER icon indicating copy to clipboard operation
CovPoolFER copied to clipboard

the loss tend to be nan

Open txiaoyun opened this issue 6 years ago • 8 comments

When I run the code, the train loss is nan. Could you give me some advice? Thank you.

txiaoyun avatar Jan 19 '19 08:01 txiaoyun

@txiaoyun @d-acharya I apply the patch suggested in tensorflow_patch.txt file,but the train_loss is still nan,Could you give me some advice? Thank you.

xiaomingdaren123 avatar Mar 08 '19 03:03 xiaomingdaren123

@xiaomingdaren123 The loss is tend to nan is caused by the decomposition of the eigenvalue.

  1. you can reduce the learning rate;
  2. you can add a clip in covpoolnet.py, as follows:

def _cal_log_cov(features): [s_f, v_f] = tf.self_adjoint_eig(features) s_f = tf.clip_by_value(s_f, 0.0001, 10000) s_f = tf.log(s_f) s_f = tf.matrix_diag(s_f) features_t = tf.matmul(tf.matmul(v_f, s_f), tf.transpose(v_f, [0, 2, 1])) return features_t

But loss will grow bigger and fall again and again, also I did not reproduce the author's results.

txiaoyun avatar Mar 13 '19 04:03 txiaoyun

The gradient computation of tensorflow eigen decomposition is most likely producing NaNs. The proposed technique tensorflow_patch.txt worked previously on a different system (with occasional failures). Recently I tried it on a different system and it consistently produced NaNs too (on tensorflow 1.13 it produces nan after few epochs, where as in tensorflow 1.2 it produces nans after around 600 epochs). I will check if changing regularization and learning rate will avoid this. I will try to check this and update. Clipping is alternative solution and was actually used to train model4 and model2 mentioned in paper. However training again, I myself am unable to get same exact numbers.

d-acharya avatar May 11 '19 08:05 d-acharya

However, if you cannot get the numbers in the paper by using pretrained models, I would try following data: https://drive.google.com/open?id=1eh93I0ndg6X-liUJDYpWveIShLd0ao_x and make sure following versions are used: scikit-learn==0.18.1 tensorflow==1.2.0 numpy==1.14.4 Pillow==4.3.0 python 2.7

Different version of pickle or classifier was found to effect reported numbers.

d-acharya avatar May 11 '19 08:05 d-acharya

@d-acharya @txiaoyun @xiaomingdaren123

I didn't apply the patch suggested in tensorflow_patch. And what I use is python 3.5.

  1. The “Loss” value has been floating in a small range, neither increasing nor decreasing.
  2. The “RegLoss” value remained constant at 0.

Could you give me some advice? Thank you.

YileYile avatar Sep 09 '19 03:09 YileYile

@txiaoyun @xiaomingdaren123 @d-acharya @YileYile I am facing the same problem. The loss becomes NAN after 10 epochs. Did anyone find the solution? thx

fredlll avatar Sep 20 '19 06:09 fredlll

@txiaoyun @xiaomingdaren123 @YileYile @fredlll You can change the version of tensorflow and related libraries that the author said. Then you can not get NaN.

PR1706 avatar Sep 29 '19 01:09 PR1706

Hi , how to solve the problem about 'without dlpcnn'?

dyt0414 avatar Nov 23 '20 13:11 dyt0414