uncertainty-baselines icon indicating copy to clipboard operation
uncertainty-baselines copied to clipboard

Questions about prediction of SGNP

Open JianxiangFENG opened this issue 3 years ago • 5 comments

Hi @jereliu ,

I have a few questions about the inference stage of SGNP:

  1. According to the Eq 9) and Algorithrm 1) in the paper, shouldn't there be K precision matrix for each dimension of the output, where K is the number of class? And the dimension of each one is [ batch_size, batch_size], but the total matrix should be [K, batch_size, batch_size], am I understanding something wrong? And in the codes, I can just find the a single covariance matrix with size of [batch_size, batch_size].
  2. After searching the codes for a while, I couldn't find the sampling step which is the 5th step in Algorithm 2). Without this sampling step, the prediction is similar to MAP prediction except for the difference during training. This way to make prediction should be essential in this method, right?

I would appreciate if you can explain more to me.

Best, Jianxiang

JianxiangFENG avatar Feb 02 '21 21:02 JianxiangFENG

Hi Jianxiang,

Thanks for getting in touch! Sorry for the confusion about the mismatch between the paper and this implementation. Yes we made two changes for computational feasibility / performance reasons:

  1. After some experimentation, we replaced the Laplace-approximated posterior variance with that under Gaussian likelihood. So that one matrix is shared across all classes. Two reasons for making this change are (1) computationally feasibility (esp for ImageNet type of tasks), (2) empirically better OOD performance.

  2. We replaced the Monte-Carlo approximation with the Mean-field approximation for computational feasibility (e.g., here. This is mentioned in Appendix A).

jereliu avatar Feb 03 '21 00:02 jereliu

Thank you for the quick reply!

  1. After some experimentation, we replaced the Laplace-approximated posterior variance with that under Gaussian likelihood. So that one matrix is shared across all classes. Two reasons for making this change are (1) computationally feasibility (esp for ImageNet type of tasks), (2) empirically better OOD performance.

Ok, it's more computationally efficient. However, I don't get the intuition that one variance for the classes can lead to better performance. Because one variance for all classes doesn't seem to make a lot of sense. It's just like temperature scaling with one temperature hyperparamter, instead of modelling the uncertainty for each class. Maybe for other scenarios different variances for different classes are needed. But thanks for letting me know about this.

  1. We replaced the Monte-Carlo approximation with the Mean-field approximation for computational feasibility (e.g., here. This is mentioned in Appendix A).

This is a neat and simple approximation. I am wondering how large is the difference between the sampling and the approximation. I am kind of sure you have done experiments on that. Any systematic comparisons or take-home messages about this? Thank you in advance!

JianxiangFENG avatar Feb 05 '21 14:02 JianxiangFENG

Hi, just throwing a possible explanation here for 1. maybe one covariance matrix for all classes is better because it reduces the overfitting. maybe on Large datasets, we would see the opposite (more intuitive) effect: better performance when using covariance matrix for each class, there we would have enough data to better approximate a covariance matrix for each class.

mdabbah avatar Mar 01 '21 23:03 mdabbah

Thank you for the quick reply!

  1. After some experimentation, we replaced the Laplace-approximated posterior variance with that under Gaussian likelihood. So that one matrix is shared across all classes. Two reasons for making this change are (1) computationally feasibility (esp for ImageNet type of tasks), (2) empirically better OOD performance.

Ok, it's more computationally efficient. However, I don't get the intuition that one variance for the classes can lead to better performance. Because one variance for all classes doesn't seem to make a lot of sense. It's just like temperature scaling with one temperature hyperparamter, instead of modelling the uncertainty for each class. Maybe for other scenarios different variances for different classes are needed. But thanks for letting me know about this.

  1. We replaced the Monte-Carlo approximation with the Mean-field approximation for computational feasibility (e.g., here. This is mentioned in Appendix A).

This is a neat and simple approximation. I am wondering how large is the difference between the sampling and the approximation. I am kind of sure you have done experiments on that. Any systematic comparisons or take-home messages about this? Thank you in advance!

@JianxiangFENG Did you get or figure out an answer to your last question? I am wondering this myself :)

Jordy-VL avatar Jun 03 '21 21:06 Jordy-VL

@Jordy-VL hey, I did not follow it in the end. But the paper relevant paper (https://arxiv.org/abs/2006.0758) is worth reading.

JianxiangFENG avatar Jun 05 '21 10:06 JianxiangFENG