n3net training is unstable for correspondences experiment

I find training is unstable when using n3net in correspondences experiments, the training loss increases suddenly and the valid accuracy drops simultaneously. It falls into bad local minima.

So, has anyone encountered this problem? I use the default config for training.

training-loss valid-acc

Dec 05 '18 03:12 zjhthu

@tobiasploetz

Dec 05 '18 11:12 zjhthu

We also observed instability of the training (also without using the N3 block) in later stages of the training when training on the St. Peters dataset.

Since accuracy on the validation set peaked early anyway (<150k iterations), we did not bother investigating this issue too deeply.

Dec 07 '18 13:12 tobiasploetz

Hi @tobiasploetz , we conducted more experiments. The following is the result. We observed that the training failed after some iterations (~150k for St. Peters, ~50k for Brown) on both datasets. On the brown_bm_3_05 dataset, we got good result acc_qt_auc20_ours_ransac=0.5111 (0.5100 in the paper). However, on the St. Peters dataset, we got acc_qt_auc20_ours_ransac=0.5263 (0.5740 in the paper). We use the default training configuration in https://github.com/vcg-uvic/learned-correspondence-release. Would you like to provide more details about your training? Do we need to run more the one time and select the best one on St. Peters, or is there anything else we need to modify? stpeter_loss Figure 1: training loss on St. Peters dataset
stpeter_acc_0 5263 Figure 2: val_acc and test_acc on St. Peters dataset, on test dataset acc_qt_auc20_ours_ransac=0.5263
brown_loss Figure 3: training loss on brown_bm_3_05 dataset
brwon_acc_0 5111 Figure 4: val_acc and test_acc on brown_bm_3_05 dataset, on test dataset acc_qt_auc20_ours_ransac=0.5111

Dec 21 '18 09:12 sundw2014

Hi @sundw2014,

I will look into this shortly. For the time being, here are the training curves that we got on StPeters:

Fig. 1: training loss on St. Peters dataset

Fig. 2: val_acc and test_acc on St. Peters dataset

For us, training broke down roughly at iteration 250k.

Bests, Tobias

Jan 08 '19 09:01 tobiasploetz

Hi @sundw2014,

just a quick update on this issue. I ran some experiments and here is what I found:

Running the code on Cuda 9 + GTX 1080 or Titan X works most of the times (I observed one training that crashed after ~130k iterations, the other trainings went fine and reached comparable numbers).
Running the code on Cuda 10 + RTX 2080 always failed after a varying number of epochs :(

So it seems to be an issue of the Cuda/GPU/CuDNN version that is used. Can you provide specifics about your system?

Bests, Tobias

Feb 05 '19 08:02 tobiasploetz

Hi @tobiasploetz ,

I am sorry for getting back to you so late. We run the code on CUDA9.2, Tesla M40 24GB, Python 3.5.4 (from anaconda).

Best Regards

Feb 21 '19 11:02 sundw2014

Hi @sundw2014,

I think I found the culprit that causes the unstable training. The original implementation of the classification loss contains this line.

classif_losses = -tf.log(tf.nn.sigmoid(c * self.logits))

This results in infs when the argument to the sigmoid becomes small. Changing the above line to the numerical stable log_sigmoid function solves the numerical problems.

classif_losses = -tf.log_sigmoid(c * self.logits)

However, things are more complex because the training is now diverging pretty quickly. My current guess is that the numerical unstable implementation was implicitly filtering out some outlier predictions (since training batches with non-finite parameter gradients are skipped). With the numerical stable log-sigmoid implementation, however, these training batches are not skipped and hence the outliers will affect the parameter updates.

I am currently investigating this further :)

Bests, Tobias

Feb 22 '19 12:02 tobiasploetz

Cool! Thanks for your help. I look forward to hearing from you again.

Feb 22 '19 12:02 sundw2014

n3net n3net copied to clipboard

training is unstable for correspondences experiment

n3net
n3net copied to clipboard