Laplace icon indicating copy to clipboard operation
Laplace copied to clipboard

Subnetwork and Marginal Likelihood for Prior Precision estimation

Open georgezefko opened this issue 2 years ago • 13 comments

Hello,

Thanks again for this library :)

I am using the library to perform Laplace on a specific module of my network. In the paper about subnetwork inference, I read that for finding the optimal prior precision is better to use Grid Search CV instead of Marginal likelihood. I used both in my case (just to check) and it seems I get approximately the same results for prior precision.

I understand that by using margllik on a subset of weights might not give a representative result of the entire model but can that be wrong? Is there a strong reason why CV would make more sense to estimate prior precision than the marginal likelihood?

At last, do you think that might be the case that I get similar results by accident or wrong implementation of CV?

georgezefko avatar Jul 16 '22 19:07 georgezefko

Hi @georgezefko,

Thanks a lot for your interest in our library and the subnetwork LA in particular, I really appreciate it — sorry also for the late response!

Indeed, using the marginal likelihood (ML) for prior precision tuning might not yield reasonable results; I cannot say much more than that I’m afraid, as we don’t yet have a good understanding of how to properly use the ML in this context, which remains an open research question.

If you get similar results with both the ML and CV, then this might indeed be coincidence.

CV should make more sense as it looks directly at validation performance, while the ML depends on the particular shape of the weight posterior (where again, it’s not well understood how this behaves with posteriors over just a subset of weights).

Therefore, using CV is currently recommended for prior precision tuning with subnetwork LA.

Hope that helps, otherwise please feel free to ask further questions!

edaxberger avatar Jul 26 '22 13:07 edaxberger

Thank you for your reply @edaxberger it definitely clears things out.

georgezefko avatar Jul 28 '22 15:07 georgezefko

@edaxberger One more question I have seen in your examples is that you use Marginal Likelihood when you perform Laplace on the Last Layer. Is there any specific reason that it works in this case? Since Last Layer is a form of subnetwork shouldn't ML be problematic as well?

georgezefko avatar Aug 01 '22 19:08 georgezefko

Thanks for the great follow-up question! So the Last Layer approximation is a special case, as there the model can be interpreted as using a fixed feature extractor (i.e. comprising of all layers except the last one) with just a Bayesian linear / logistic regression model on top (i.e. the last layer). Therefore, since we effectively simply have a linear model, we should be able to perform model selection with the ML. However, in the case of more general subnetworks, we don't have such a simple linear model correspondence (at least not one where the outputs of the penultimate NN layer serve as the features), so there it becomes more tricky. Hope that makes sense!

edaxberger avatar Aug 02 '22 09:08 edaxberger

@edaxberger Thank you for the quick response. Upon that I have another question which is related to the specific case I work on.

So I have a Spatial Transformer Network (https://arxiv.org/abs/1506.02025). The model uses a localisation network (network of interest) which apply spatial transformation in the input by learning some parameters theta.

The output of that network then feeds another convolutional neural network which performs the classification task in this case.

I have used Subnetwork Laplace to the last layer of the localisation network. Here is where I get the same results from ML and CV for prior precision.

My question now is this, can the Last Layer of the localisation network (since we are talking about a CNN) depicts the same properties as the Last Layer Laplace (only for the localisation network though) so renders the use of ML reasonable?

It might be very specific question but I would like to know your thoughts.

georgezefko avatar Aug 02 '22 11:08 georgezefko

Thanks for the follow-up question! That's an interesting use case. I'm not too familiar with Spatial Transformer Networks, but what you're describing sounds somewhat reasonable.

What is the reason for using Last Layer Laplace on the localisation network instead of the classification network? Do you only want to perform model selection for the former but not the latter?

In any case, if you use Last Layer Laplace specifically, @wiseodd is an expert and might be able to give you more insights. Also, @aleximmer is an expert on ML model selection, so might also have some thoughts on this specific setting.

edaxberger avatar Aug 03 '22 07:08 edaxberger

@edaxberger thanks for the quick response.

The idea is that I want to turn the localisation network into a Bayesian one to be able to sample several transformations from it. Thus, I will be able to perform a kind of data augmentation.

georgezefko avatar Aug 03 '22 07:08 georgezefko

I see, the idea sounds sensible.

In this case, I have another question: what exactly is the output of the localisation network? As I understand, the localisation network is mapping from image space to image space (i.e. applying some learned transformation to the image)? In that case, I am wondering which likelihood you were using, as our Laplace package currently only works for regression and cross-entropy likelihoods. Are you using a pixel-wise (independent) cross-entropy likelihood over the full output image? This seems quite expensive to do (for large images), so I'm wondering how exactly you implemented this.

edaxberger avatar Aug 03 '22 14:08 edaxberger

Yes, sounds right. More formally, the task of the localisation network is to identify the parameters θ of the inverse transformation Tθ(G) that will translate the input feature map into a canonical pose, making recognition in subsequent layers easier. It has a regression layer at the end to get those parameters

In my case, I have used the MNIST dataset as well as a more complicated one called Mapiliary Traffic Signs.

I am not sure what you mean by "which" likelihood. I have used the option "marglik" on the selected module (last layer) of the localisation network.

Does that make sense?

georgezefko avatar Aug 03 '22 17:08 georgezefko

Ah I see, so it’s effectively a regression model — in that case it does make sense to me to use a Last Layer LA.

Curious to see how well this works, please feel free to keep us updated on your progress. And of course don’t hesitate to ask further questions!

edaxberger avatar Aug 04 '22 07:08 edaxberger

Hi @edaxberger, hi @georgezefko, very interesting discussion!

I'd like to add that the regression task (predicting $\theta$) is essentially unsupervised. You are computing $p(y | x) = \int p(y | x, \theta) p(\theta | x) d \theta$ where $p(\theta | x)$ is parametrised by the Laplace approximation (because $\theta = f_w(x)$ for some weights $w$, the posterior over which we obtain via Laplace. This is the model from https://arxiv.org/pdf/2004.03637.pdf using Laplace on the weights instead of VI on the outputs directly. The question is: Does the last layer argument hold in this case where there is no ground truth on the regression task, but rather a downstream task? In other words, does it make sense to use the marginal likelihood for a subnetwork (last layer of a certain module in the neural net) given that there are more weights downstream which are fixed? My intuition is that given the downstream weights are fixed, it is actually ok (via the same argument as for the last layer -- the downstream weights are just a deterministic function mapping the outputs to other outputs).

I'd be curious to hear your thoughts! Thanks!

polaschwoebel avatar Aug 11 '22 08:08 polaschwoebel

Thanks a lot for joining the discussion, @polaschwoebel, and sorry for the late response!

What you describe sounds sensible to me as well.

Perhaps @wiseodd or @AlexImmer, who are experts on the last-layer Laplace approximation and the Laplace marginal likelihood, respectively, have further thoughts?

edaxberger avatar Aug 22 '22 10:08 edaxberger

I agree with @polaschwoebel's comment that it is somewhat sensible once the other weights are fixed and this is also consistent with what we see in the literature: 1) deep kernel learning, which is trained using the marginal likelihood ignoring all but the last layer, tends to overfit because feature-extraction layers are not fixed (e.g. https://proceedings.mlr.press/v161/ober21a/ober21a.pdf), 2) a similar approach with online Laplace model selection only on the last layer also fails, and 3) using marginal likelihood on the last layer after fixing the feature extractor works fine. The last two observations are from the Laplace redux paper and a work where we use marginal likelihood only on the last layer to compare natural language representations (https://arxiv.org/pdf/2110.08388.pdf).

aleximmer avatar Sep 10 '22 15:09 aleximmer