pinot icon indicating copy to clipboard operation
pinot copied to clipboard

How to define a model and a loss

Open karalets opened this issue 4 years ago • 29 comments

https://github.com/choderalab/pinot/blob/e1d326043f2157f8572518a609c981cb0f444941/pinot/app/gaussian_variational.py#L16

Here, you define a loss by handwriting some math. This is doable, but I have some problems with it.

First, the loss is basically the negative loglikelihood of the output noise model. As such, it is a model property and should live inside the model.

This leads me to the second point: We should probably have models being classes here, i.e.

class Molecule_graph_nn(nn.Module):...

Third, while I highly appreciate you writing the math directly here, I also do recommend using some things that may be less error-prone.

In particular, my team wrote the distributions library for pytorch: https://pytorch.org/docs/stable/distributions.html

Here, you can do the following quite nicely (see https://pytorch.org/docs/stable/distributions.html#normal)

import pytorch
from pytorch.distributions import Normal
import Molecule_graph_nn

graph_nn_molecules = Molecule_graph_nn(args,...)

# consider X being a molecule input
[mu_network, log_sigma_network] = graph_nn_molecules.forward(X)

mu = mu_network
sigma = torch.exp(log_sigma_network)
dist = Normal(mu, sigma)

loss = -dist.log_prob(X)

#now you can use loss inside a regular training loop

Even better would be having this fully modularized.

i.e.

class Molecule_graph_nn(nn.Module):...
    
    def condition(self, X):
        theta = self.forward(X)
        mu = theta[0]
        sigma = torch.exp(theta[1])
        dist = Normal(mu, sigma)
        return dist

    def sample(self, X):
        dist = self.condition(X)
        m_sample = dist.sample()
        return m_sample

    def loss(self, X, theta):
        dist = self.condition(X)
        loss = -dist.log_prob(X)
        return loss

    def get_loss(self, X):
         theta = self.forward(X)
         loss_val = self.loss(X, theta)
        return loss_val

Then you can access loss from your model:


graph_nn_molecules = Molecule_graph_nn(args,...)
# consider X being a molecule input

loss=graph_nn_molecules.get_loss(X)

#now you can use loss inside a regular training loop
....

Edit: I just noticed I omitted explicitly accounting for measurements m in here during loss etc. Let me know if I should edit to clarify or it is self explanatory.

karalets avatar Mar 27 '20 21:03 karalets

@karalets

Sorry for being out of the loop for a few days.

I left a few experiment results here:

https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-05-173959786741/report.md

It seems that the model did not converge.

I did some sanity check and it seems that there is no issues when it comes to minimize MSE alone.

Also, for this formulation, what would prevent the model from predicting sigma close to zero and yield very high likelihood, regardless of mean point estimate?

yuanqing-wang avatar Apr 09 '20 19:04 yuanqing-wang

Hey,

Thanks for adding this. Can we unpack these results and contrast them with MSE? Also, if you ignore the variance from the neural network output and use the loglikelihood with a fixed variance of 1.0 you should exactly recover the MSE loss.

i.e. dist = Normal(mean=theta_mean, sigma=torch.ones(...))

Could you please run this as a sanity check quickly?

And to clarify this is the case: look at the functional form of log (Gaussian(mu, 1)), it exactly reverts to the MSE term which is in the exponent of the gaussian as the sigma of 1 disappears.

karalets avatar Apr 09 '20 19:04 karalets

https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-10-151826363179/report.md

Yes fixing sigma to 1 makes it train just fine.

yuanqing-wang avatar Apr 10 '20 19:04 yuanqing-wang

Great, that means we have numerical instabilities with learning sigma.

One thing immediately comes to mind: sigma has to be positive, so typically we learn a transformed version of it with NNs that can only map to positive values. i.e. bla = nn(x); sigma = exp(bla) -> N(mu, sigma)is defined as sigma is ensured to be positive. Turns out bla = log(sigma) implicitly in that parametrization (but one never writes that).

Is that how you ensure positivity of sigma? Can you point me to the code?

karalets avatar Apr 10 '20 19:04 karalets

Now I've constraint it to be positive by adding a parameter transformation function.

https://github.com/choderalab/pinot/blob/2f2996311fd418f24b3f1a9c94aa654f39605d95/pinot/net.py#L23

yuanqing-wang avatar Apr 10 '20 20:04 yuanqing-wang

But numerical stability aside, I still don't understand why the model wouldn't predict a super small sigma and thus get a huge likelihood value for even some of the data points.

yuanqing-wang avatar Apr 10 '20 20:04 yuanqing-wang

You misunderstand the purpose of the noise model.

It can only do this if there is no measurement error whatsoever and the mean can always be great. If, however, there is any error, it needs to learn a noise shape.

This entire discussion is about the shape of observation noise. A model would never learn infinitely small sigma unless it has noiseless outputs which it can predict perfectly, in which case an infinitely small sigma is perfectly appropriate.

In words: the sigma here represents the model's belief about the noisy-ness of the outputs. This is not the same as model uncertainty, it is irreducible uncertainty that comes from noise, not from model uncertainty.

karalets avatar Apr 10 '20 20:04 karalets

I think I'm going to try other means to constraint sigma to be positive, namely exp rather than abs.

yuanqing-wang avatar Apr 10 '20 20:04 yuanqing-wang

Again, please point me to the code now, there are best practices here that I would like to encourage. This is almost certainly the source of the behavior you encounter.

karalets avatar Apr 10 '20 20:04 karalets

Also if you look into my example above in this issue, I showed how to do this exactly in code and I never used an abs.

reminder:

 def condition(self, X):
        theta = self.forward(X)
        mu = theta[0]
        sigma = torch.exp(theta[1])
        dist = Normal(mu, sigma)
        return dist

karalets avatar Apr 10 '20 20:04 karalets

Yes. The only difference is abs rather than exp. I'll change it to exp now.

https://github.com/choderalab/pinot/blob/master/pinot/net.py#L23

yuanqing-wang avatar Apr 10 '20 20:04 yuanqing-wang

Now I've constraint it to be positive by adding a parameter transformation function.

https://github.com/choderalab/pinot/blob/2f2996311fd418f24b3f1a9c94aa654f39605d95/pinot/net.py#L23

Yeah abs is no good. Use Exp (you could also do softplus).

Abs has problems with gradients.

karalets avatar Apr 10 '20 20:04 karalets

Yes. The only difference is abs rather than exp. I'll change it to exp now.

https://github.com/choderalab/pinot/blob/master/pinot/net.py#L23

Please retry training after this and compare to the likelihoods and errors you get when setting sigma to 1.

karalets avatar Apr 10 '20 20:04 karalets

Free sigma: https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-10-165531617559/report.md

sigma=1: https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-10-151826363179/report.md

yuanqing-wang avatar Apr 10 '20 21:04 yuanqing-wang

Both are converging tho, judging from the loss trajectory:

sigma=1: https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-10-151826363179/loss.png

free sigma: https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-10-165531617559/loss.png

let me just run this for a bit more epochs

yuanqing-wang avatar Apr 10 '20 21:04 yuanqing-wang

That makes no sense... maybe play with hyperparameters and plot some loss curves to inspect what is actually happening?

MSE is clearly the wrong model, it can't be that a slightly and modestly better model is so much worse, especially given that the mean is still the same and all this does is rescale the error residuals.

karalets avatar Apr 10 '20 21:04 karalets

seems that for the Gaussian model the results are not consistent across multiple experiments. looking into that

yuanqing-wang avatar Apr 11 '20 02:04 yuanqing-wang

ah adding a small epsilon on sigma seems to fix it

yuanqing-wang avatar Apr 13 '20 02:04 yuanqing-wang

ok then there is a numerical issue somewhere, that object should not even need the epsilon... especially since the epsilon changes the noise model to be 'at least epsilon'

You could also try lower bounding the noise, i.e. by clipping in pytorch to be at least some reasonable value.

karalets avatar Apr 13 '20 04:04 karalets

Question: how are we going to preserve this sigma thing after we move on to Bayesian methods namely sampling using Monte Carlo?

yuanqing-wang avatar Apr 15 '20 18:04 yuanqing-wang

image

Going back to this issue a bit. Seems that the behavior is not consistent across multiple experiments.

yuanqing-wang avatar Apr 30 '20 02:04 yuanqing-wang

@karalets @maxentile

the scripts for this experiment is here: https://colab.research.google.com/drive/1B_-M0Dn9TekYfEFkJD4QppkpJmx4h4iE

yuanqing-wang avatar May 01 '20 19:05 yuanqing-wang

Sorry for slow response!

A couple questions:

  • Do you have a prior defined on sigma (or log_sigma)?
  • Could you comment on what exactly is different between the multiple experiments plotted above? (different random seed only, or different data subsample, or ...)
  • For the experiments where you plot RMSE as a function of iteration, could you also plot the log_probability and the sigma (or log_sigma) parameter? These may be diagnostic.

maxentile avatar May 07 '20 16:05 maxentile

All this should not be a problem once the pipeline is cleaned up enough that we can quickly iterate on some things.

On Thu, May 7, 2020 at 9:57 AM Josh Fass [email protected] wrote:

Sorry for slow response!

A couple questions:

  • Do you have a prior defined on sigma (or log_sigma)?
  • Could you comment on what exactly is different between the multiple experiments plotted above? (different random seed only, or different data subsample, or ...)
  • For the experiments where you plot RMSE as a function of iteration, could you also plot the log_probability and the sigma (or log_sigma) parameter? These may be diagnostic.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/choderalab/pinot/issues/1#issuecomment-625375798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWCL7N63XM4DDDGPQAKNTRQLSATANCNFSM4LVJPPTA .

karalets avatar May 07 '20 17:05 karalets

Also: we had a manual chat on this last week with YQ, I would avocate to first try simple things like using a globally learned sigma, and in both homoschedastic and heteroschedastic sigma cases to prune the sigma to 'simulate' a constraint. If it trains better with some pruning with min-max values this will help.

We will put priors on everything in a future iteration, and priors are effectively a softer version of such pruning, but just identifying the issue is done easily with simple things like constraints.

We won't need the additive epsilon that YQ introduced in the future once this is clarified.

On Thu, May 7, 2020 at 10:16 AM Theofanis Karaletsos < [email protected]> wrote:

All this should not be a problem once the pipeline is cleaned up enough that we can quickly iterate on some things.

On Thu, May 7, 2020 at 9:57 AM Josh Fass [email protected] wrote:

Sorry for slow response!

A couple questions:

  • Do you have a prior defined on sigma (or log_sigma)?
  • Could you comment on what exactly is different between the multiple experiments plotted above? (different random seed only, or different data subsample, or ...)
  • For the experiments where you plot RMSE as a function of iteration, could you also plot the log_probability and the sigma (or log_sigma) parameter? These may be diagnostic.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/choderalab/pinot/issues/1#issuecomment-625375798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWCL7N63XM4DDDGPQAKNTRQLSATANCNFSM4LVJPPTA .

karalets avatar May 07 '20 17:05 karalets

tried all options here: https://colab.research.google.com/drive/1B_-M0Dn9TekYfEFkJD4QppkpJmx4h4iE?usp=sharing

yuanqing-wang avatar May 07 '20 18:05 yuanqing-wang

  • [x] track values of sigma
  • [x] report likelihood too
  • [x] clipping at a larger value
  • [ ] enable priors to be encoded in pinot.Net
  • [ ] priors
  • [ ] infrastructure to make giant tables

yuanqing-wang avatar May 07 '20 18:05 yuanqing-wang

Also tagging @dnguyen1196 here to be aware of these discussions.

karalets avatar May 13 '20 18:05 karalets

Although not sure if this is representative, heteroscedastic model in this case indeed maximized the likelihood of training data, but it does not generalize well?

Results: https://htmlpreview.github.io/?https://github.com/choderalab/pinot/blob/infrastructure/scripts/heteroscedastic_vs_homoscedastic/results.html

These are trained on the entire ESOL dataset for 2000 epochs.

(pardon me for the typo in the last section, sigma is trainable for the third experiment.)

Scripts: https://github.com/choderalab/pinot/blob/infrastructure/scripts/heteroscedastic_vs_homoscedastic/run.py

yuanqing-wang avatar May 13 '20 20:05 yuanqing-wang