CovPoolFER
CovPoolFER copied to clipboard
the loss tend to be nan
When I run the code, the train loss is nan. Could you give me some advice? Thank you.
@txiaoyun @d-acharya I apply the patch suggested in tensorflow_patch.txt file,but the train_loss is still nan,Could you give me some advice? Thank you.
@xiaomingdaren123 The loss is tend to nan is caused by the decomposition of the eigenvalue.
- you can reduce the learning rate;
- you can add a clip in covpoolnet.py, as follows:
def _cal_log_cov(features): [s_f, v_f] = tf.self_adjoint_eig(features) s_f = tf.clip_by_value(s_f, 0.0001, 10000) s_f = tf.log(s_f) s_f = tf.matrix_diag(s_f) features_t = tf.matmul(tf.matmul(v_f, s_f), tf.transpose(v_f, [0, 2, 1])) return features_t
But loss will grow bigger and fall again and again, also I did not reproduce the author's results.
The gradient computation of tensorflow eigen decomposition is most likely producing NaNs. The proposed technique tensorflow_patch.txt worked previously on a different system (with occasional failures). Recently I tried it on a different system and it consistently produced NaNs too (on tensorflow 1.13 it produces nan after few epochs, where as in tensorflow 1.2 it produces nans after around 600 epochs). I will check if changing regularization and learning rate will avoid this. I will try to check this and update. Clipping is alternative solution and was actually used to train model4 and model2 mentioned in paper. However training again, I myself am unable to get same exact numbers.
However, if you cannot get the numbers in the paper by using pretrained models, I would try following data: https://drive.google.com/open?id=1eh93I0ndg6X-liUJDYpWveIShLd0ao_x and make sure following versions are used: scikit-learn==0.18.1 tensorflow==1.2.0 numpy==1.14.4 Pillow==4.3.0 python 2.7
Different version of pickle or classifier was found to effect reported numbers.
@d-acharya @txiaoyun @xiaomingdaren123
I didn't apply the patch suggested in tensorflow_patch. And what I use is python 3.5.
- The “Loss” value has been floating in a small range, neither increasing nor decreasing.
- The “RegLoss” value remained constant at 0.
Could you give me some advice? Thank you.
@txiaoyun @xiaomingdaren123 @d-acharya @YileYile I am facing the same problem. The loss becomes NAN after 10 epochs. Did anyone find the solution? thx
@txiaoyun @xiaomingdaren123 @YileYile @fredlll You can change the version of tensorflow and related libraries that the author said. Then you can not get NaN.
Hi , how to solve the problem about 'without dlpcnn'?