n3net icon indicating copy to clipboard operation
n3net copied to clipboard

training is unstable for correspondences experiment

Open zjhthu opened this issue 6 years ago • 8 comments

I find training is unstable when using n3net in correspondences experiments, the training loss increases suddenly and the valid accuracy drops simultaneously. It falls into bad local minima.

So, has anyone encountered this problem? I use the default config for training.

training-loss valid-acc

zjhthu avatar Dec 05 '18 03:12 zjhthu

@tobiasploetz

zjhthu avatar Dec 05 '18 11:12 zjhthu

We also observed instability of the training (also without using the N3 block) in later stages of the training when training on the St. Peters dataset.

Since accuracy on the validation set peaked early anyway (<150k iterations), we did not bother investigating this issue too deeply.

tobiasploetz avatar Dec 07 '18 13:12 tobiasploetz

Hi @tobiasploetz , we conducted more experiments. The following is the result. We observed that the training failed after some iterations (~150k for St. Peters, ~50k for Brown) on both datasets. On the brown_bm_3_05 dataset, we got good result acc_qt_auc20_ours_ransac=0.5111 (0.5100 in the paper). However, on the St. Peters dataset, we got acc_qt_auc20_ours_ransac=0.5263 (0.5740 in the paper). We use the default training configuration in https://github.com/vcg-uvic/learned-correspondence-release. Would you like to provide more details about your training? Do we need to run more the one time and select the best one on St. Peters, or is there anything else we need to modify? stpeter_loss Figure 1: training loss on St. Peters dataset
stpeter_acc_0 5263 Figure 2: val_acc and test_acc on St. Peters dataset, on test dataset acc_qt_auc20_ours_ransac=0.5263
brown_loss Figure 3: training loss on brown_bm_3_05 dataset
brwon_acc_0 5111 Figure 4: val_acc and test_acc on brown_bm_3_05 dataset, on test dataset acc_qt_auc20_ours_ransac=0.5111

sundw2014 avatar Dec 21 '18 09:12 sundw2014

Hi @sundw2014,

I will look into this shortly. For the time being, here are the training curves that we got on StPeters:

image Fig. 1: training loss on St. Peters dataset

Fig. 2: val_acc and test_acc on St. Peters dataset Fig. 2: val_acc and test_acc on St. Peters dataset

For us, training broke down roughly at iteration 250k.

Bests, Tobias

tobiasploetz avatar Jan 08 '19 09:01 tobiasploetz

Hi @sundw2014,

just a quick update on this issue. I ran some experiments and here is what I found:

  1. Running the code on Cuda 9 + GTX 1080 or Titan X works most of the times (I observed one training that crashed after ~130k iterations, the other trainings went fine and reached comparable numbers).
  2. Running the code on Cuda 10 + RTX 2080 always failed after a varying number of epochs :(

So it seems to be an issue of the Cuda/GPU/CuDNN version that is used. Can you provide specifics about your system?

Bests, Tobias

tobiasploetz avatar Feb 05 '19 08:02 tobiasploetz

Hi @tobiasploetz ,

I am sorry for getting back to you so late. We run the code on CUDA9.2, Tesla M40 24GB, Python 3.5.4 (from anaconda).

Best Regards

sundw2014 avatar Feb 21 '19 11:02 sundw2014

Hi @sundw2014,

I think I found the culprit that causes the unstable training. The original implementation of the classification loss contains this line.

classif_losses = -tf.log(tf.nn.sigmoid(c * self.logits))

This results in infs when the argument to the sigmoid becomes small. Changing the above line to the numerical stable log_sigmoid function solves the numerical problems.

classif_losses = -tf.log_sigmoid(c * self.logits)

However, things are more complex because the training is now diverging pretty quickly. My current guess is that the numerical unstable implementation was implicitly filtering out some outlier predictions (since training batches with non-finite parameter gradients are skipped). With the numerical stable log-sigmoid implementation, however, these training batches are not skipped and hence the outliers will affect the parameter updates.

I am currently investigating this further :)

Bests, Tobias

tobiasploetz avatar Feb 22 '19 12:02 tobiasploetz

Cool! Thanks for your help. I look forward to hearing from you again.

sundw2014 avatar Feb 22 '19 12:02 sundw2014