mx-DeepIM icon indicating copy to clipboard operation
mx-DeepIM copied to clipboard

NaN values in Predictions

Open nhafez opened this issue 6 years ago • 3 comments

After following the instructions in the latest commit and then running the train_and_test_deepim_all.sh I got the following error:

Traceback (most recent call last): File "experiments/deepim/deepim_train_test.py", line 20, in train.main() File "experiments/deepim/../../deepim/train.py", line 287, in main config.TRAIN.begin_epoch, config.TRAIN.end_epoch, config.TRAIN.lr, config.TRAIN.lr_step) File "experiments/deepim/../../deepim/train.py", line 280, in train_net prefix=prefix) File "experiments/deepim/../../deepim/core/module.py", line 1026, in fit data_batch = interBatchUpdater.forward(data_batch, preds, config) File "experiments/deepim/../../lib/pair_matching/batch_updater_py_multi.py", line 231, in forward rot_type='QUAT') File "experiments/deepim/../../lib/pair_matching/RT_transform.py", line 34, in calc_RT_delta r = mat2quat(Rm_delta) File "experiments/deepim/../../lib/pair_matching/RT_transform.py", line 459, in mat2quat vals, vecs = np.linalg.eigh(K) File "/home/saadhana/.local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 1410, in eigh w, vt = gufunc(a, signature=signature, extobj=extobj) File "/home/saadhana/.local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 95, in _raise_linalgerror_eigenvalues_nonconvergence raise LinAlgError("Eigenvalues did not converge") numpy.linalg.linalg.LinAlgError: Eigenvalues did not converge

Looks like the predicted poses are all NaN values. I printed the rotation and translation predicted:

[array([[nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, nan, nan]], dtype=float32)] [array([[nan, nan, nan], [nan, nan, nan], [nan, nan, nan], [nan, nan, nan]], dtype=float32)]

Has anybody successfully trained the network for LINEMOD or OCCLUSION datasets?

nhafez avatar Oct 11 '18 14:10 nhafez

I executed train_and_test_deepim_all.sh and doesn't found such error. Can you provide more information, like the context of this error? Does it fail in the first iteration or after a few iterations?

liyi14 avatar Oct 12 '18 02:10 liyi14

For the first batch it completes one iteration and on the next one this error happens because the predictions are NaN

nhafez avatar Oct 12 '18 09:10 nhafez

Can you change the frequent in the experiments/deepim/cfg/*_any/all.yaml(abbreviate as config below)->default to 1 and tell me the result running such modifications separately:

  1. rerun using the train_test_deepim_all.sh
  2. run the train_test_deepim_ape.yaml
  3. change train_iter_size in config->network to 1 and run any config reporting error before
  4. replace dataset: LM6D_REFINE+LM6D_REFINE_SYN to dataset: LM6D_REFINE and image_set: train_+train_ to image_set: train_
  5. change the config->TRAIN->warmup_lr to 0.0 Tell me what happened after applying such modifications, thank you.

liyi14 avatar Oct 13 '18 00:10 liyi14