MultiResUNet Validation / training scores mismatch

Hi,

I have run your network based on the notbook in a project of mine. However, I pondered quite a bit over my validation Jaccard scores outperforming the training score by a large margin. I suspect the answer lies in the rounding of yp that you perform in evaluateModel. From what I can tell, this rounding is not done in the function that is used during training. After removing this rounding the scores matched as expected.

Please let me know if I'm missing the point somewhere, or if you agree with the observation.

Thanks for a superb piece of work!

Arild

Dec 09 '20 10:12 arilmad

I also noticed that. In which line of the code did you do the changes?

Dec 09 '20 10:12 saskra

Removed yp = np.round(yp,0) in evaluateModel()

Dec 09 '20 10:12 arilmad

Ah, okay, I think I meant something else. Your change would only have an effect on what I would have called the test scores, whereas I was concerned with the validation scores, which are determined in parallel with the training scores: https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7 Because, strangely enough, even these validation values are always higher than the training values in my case, although they should logically not be, see below. But the test results after leave-one-out cross-validation are actually also significantly higher.

train_history

Can you simply do without rounding, most scores are only defined for binary decisions, aren't they?

Dec 09 '20 13:12 saskra

It is in fact the same thing. Notice that evaluateModel() is run after every epoch, and that the best model is determined based on the Jaccard score in evaluateModel(). Hence, the "test" data in the notebook does function as a validation set because it is indirectly used to pick a "best" model.

With that being said, the problem, so to say, is that the function jacard() which is used during training does not round the prediction.

Thus, if you score your test data after training and still round your prediction before scoring it, you will get a score which is not completely comparable to that calculated during training.

To address your last question, since the output layer of the model contains sigmoids it does make sense not to round the prediction during training, however, I don't know how the model would handle it if one changed jacard() to apply rounding.

Dec 09 '20 15:12 arilmad

Notice that evaluateModel() is run after every epoch, and that the best model is determined based on the Jaccard score in evaluateModel(). Hence, the "test" data in the notebook does function as a validation set because it is indirectly used to pick a "best" model.

I forgot that I had changed that in my copy, I am using a trainStep() with a maximum number of epochs, early stopping and a predefined validation data set. Afterwards I use the evaluateModel() only on the separate test set.

You are right, I am also interested about the effect of rounding or not rounding.

Dec 09 '20 15:12 saskra

Hi, I think the issue can also be solved by removing the batchnormalization in the output layer.

Your paper states: "All the convolutional layers in this network, except for the output layer, are activated by the ReLU (Rectified Linear Unit) activation function (LeCun et al., 2015), and are batch-normalized (Ioffe & Szegedy, 2015). "

by removing the batch-normalization at the output layer to a more standard output layer: line 119: conv10 = conv2d_bn(mresblock9, n_labels, 1, 1, activation=self.activation)

suggestion: conv10 = Conv2D(n_labels, (1, 1), activation=self.activation)(mresblock9)

(also applicable to the 3D net)

This resolved similar issue's for me.

Dec 10 '20 09:12 Jderuijter

btw: I tested rounding during training and the results now look much more like shown in Figure 6 of the paper. I would be interested in a learning curve including training and validation score like in my example.

Dec 10 '20 13:12 saskra

Thank you @arilmad for your interest in our project, and thanks to @saskra and @Jderuijter for keeping the conversation running . Apologies for my late response as I was occupied with some other stuff for the last few days.

First thing first, since Dice Coef or Jaccard Index are defined for binary values, we should round the values to compute them.

In my notebook, honestly speaking I didn't used the metrics computed during the training procedure, so the fact that the values were not rounded was ignored by me. As it has been pointed in this thread, I have used the evaluateModel() function for my purpose instead.

If you wish to compute the dice or jaccard values during training, it would be proper to round the values.

Also, another thing may be noted, regarding why I didn't include the rounding in computing those metrics in the first place. Actually, I used those functions to compute dice or jaccard based loss functions, i.e. jaccard loss = - jaccard index. Now, when we compute them as metrics we must round them to obtain the actual value, by definition. But, when we are treating them loss functions, we should not round them, rather keep them as floating number, as it would help to improve the model. For example, suppose in one epoch a certain value was 0.67 and in the next epoch it becomes 0.78. If we don't round them, the improvement will be reflected in the loss value, but if we round them the improvement get's lost as round(0.67) = round(0.78) = 1. Since, I actually used those functions to experiment with dice or jaccard based loss function, I didn't do the rounding there.

Dec 14 '20 13:12 nibtehaz

But, when we are treating them loss functions, we should not round them, rather keep them as floating number, as it would help to improve the model. For example, suppose in one epoch a certain value was 0.67 and in the next epoch it becomes 0.78. If we don't round them, the improvement will be reflected in the loss value, but if we round them the improvement get's lost as round(0.67) = round(0.78) = 1. Since, I actually used those functions to experiment with dice or jaccard based loss function, I didn't do the rounding there.

To add one more aspect, I once observed how the Jaccard score on the validation data set behaves during training, recording both the relative values as in the original source code and the rounded values during the same run. Interestingly, on this dataset, it looks like the relative values continue to increase for a while after a few epochs, while the rounded values, on the other hand, decrease again. I repeated that >100 times as part of a leave-one-out cross-validation and could observe the same pattern every time. (btw: The y label should be "Jaccard" and not "Loss".)

Relative values: train_history_relative

Rounded values: train_history_round

Jan 19 '21 08:01 saskra

MultiResUNet MultiResUNet copied to clipboard

Validation / training scores mismatch

MultiResUNet
MultiResUNet copied to clipboard