chemprop icon indicating copy to clipboard operation
chemprop copied to clipboard

[v2 BUG]: MCC loss can give Nan

Open KnathanM opened this issue 1 year ago • 2 comments

Describe the bug BinaryMCCLoss can give a loss of nan.

Example(s) The confusion matrix elements (TP, FP, TN, FN) are calculated as targets * predictions. At the start of training model predictions can happen to be negative, in which case some of the matrix elements can be negative. This leads ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)).sqrt() to possibly take the square root of a negative number and give nan.

Expected behavior I'm not sure what the expected behavior should be. Perhaps raw predictions should be forced to be positive (via a sigmoid?). Or the loss set to zero when the loss is nan?

Additional context This was also a problem in v1 (#380). Our current implementation seems to be taken directly from v1.

KnathanM avatar Apr 29 '24 19:04 KnathanM

Is it possible that the prediction could be negative? Should the prediction be the output after the sigmoid function? In v 1, PR #309 made the multiclass MCC loss return 0 in cases where the loss was infinite. However, the binary MCC was not fixed at that time. For datasets that are very imbalanced, it is possible that there is only a positive or negative result in a batch, leading to the MCC being not meaningful. Using class balance might be a work around in this case.

shihchengli avatar Apr 29 '24 20:04 shihchengli

I believe the error is here:

https://github.com/chemprop/chemprop/blob/cc5b3c11e0aedff3abadff43b9c0860fa07a4370/chemprop/nn/loss.py#L195-L197

This assumes the preds input to *MCCLoss will be a tensor of shape $(b, t, k)$, where $b$ is the batch size, $t$ is the number of tasks and $k$ is the number of classes to predict, containing either logits or predictions. Of course, this doesn't work in the case of binary predictions, where the tensor has shape $(b, t)$ but still needs to be normalized. The easiest thing to do here is to just split the binary and multiclass losses into their own classes. The mixin isn't really doing much here (and only avoids three lines of repeated code).

I also agree with #380 in that we should add a small $\epsilon \leq 10^{-8}$ value for numerical stability when it's possible we may calculate $\frac 0 0$. I can't really see that happening here if we're calculating the soft MCC (i.e., using probabilities rather than predictions), but it never hurts.

davidegraff avatar Apr 30 '24 14:04 davidegraff