My-implementation-of-What-Uncertainties-Do-We-Need-in-Bayesian-Deep-Learning-for-Computer-Vision
My-implementation-of-What-Uncertainties-Do-We-Need-in-Bayesian-Deep-Learning-for-Computer-Vision copied to clipboard
the sigma in aleatoric uncertainty
Hi, I notice that you get the mu and sigma as https://github.com/ShellingFord221/My-implementation-of-What-Uncertainties-Do-We-Need-in-Bayesian-Deep-Learning-for-Computer-Vision/blob/e6ed204cd25ac995eb8ec8da701117dcd5aabb1d/classification_aleatoric.py#L81.
As I know, sigma should be larger than zero, how can the real value in logit.split to satisfy this condition.
In 《What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?》:
In practice, we train the network to predict the log variance
I think ’sigma‘ in the code is actually ’log_sigma‘, and real sigma is exp(log_sigma). But @ShellingFord221 seems to lack this step.
Sorry for the late reply. For @whisney 's question, in regression task, the formula is (Eq. 5 in original paper):
Since
sigma
is in the denominators, the gradient sometimes can explode at the beginning of training. To avoid this, we predict alpha = log(sigma^2)
in practice. But in classification task, the formula becomes (Eq. 12 in original paper):
Now
sigma
is not in the denominators, so there is no need to predict alpha = log(sigma^2)
, we directly predict sigma
instead.
For @ConanCui 's question, you can use an absolute value layer to prevent sigma
from being negative. But in my experiments, it seems that there's no obvious difference between using this layer or not. It may depend on tasks, I can't say for sure.
Sorry for the late reply. For @whisney 's question, in regression task, the formula is (Eq. 5 in original paper):
Since
sigma
is in the denominators, the gradient sometimes can explode at the beginning of training. To avoid this, we predictalpha = log(sigma^2)
in practice. But in classification task, the formula becomes (Eq. 12 in original paper):Now
sigma
is not in the denominators, so there is no need to predictalpha = log(sigma^2)
, we directly predictsigma
instead.For @ConanCui 's question, you can use an absolute value layer to prevent
sigma
from being negative. But in my experiments, it seems that there's no obvious difference between using this layer or not. It may depend on tasks, I can't say for sure.
Thank you for your reply.
https://github.com/tanyanair/segmentation_uncertainty/blob/master/bunet/utils/tf_metrics.py#L22
This is the part of the official code of a 2018MICCAI paper about Aleatoric uncertainty(is called 'Prediction Variance' in the paper). It is a segmentation task.In the code, the author predicted log_sigma and executed exp later. So, I'm not sure which of you is right.
I think the easiest way for this question is to observe that whether the training process is stable. If gradient explodes at the beginning, you should predict alpha = log(sigma^2)
rather than sigma
. If not, I think there is no need to predict sigma
in other form.
I think the easiest way for this question is to observe that whether the training process is stable. If gradient explodes at the beginning, you should predict
alpha = log(sigma^2)
rather thansigma
. If not, I think there is no need to predictsigma
in other form.
In my opinion,in terms of network structure(output directly connected to conv and
has no activation.), we cannot guarantee that the network output is greater than 0, but sigma^2
must be >= 0. So we only expect the network to predict log(sigma^2)
and execute exp make it > 0.
But The network has strong learning ability. This means that the network will learn to output sigma^2
if we do not execute exp(although there is no guarantee from the network structure that it must be positive, the network will tend to output a value of > =0.) On the contrary, if we execute exp, the network will learn to output log(sigma^2)
.(more stable in theory)
I don't know if this description is correct, thank you.
I think the easiest way for this question is to observe that whether the training process is stable. If gradient explodes at the beginning, you should predict
alpha = log(sigma^2)
rather thansigma
. If not, I think there is no need to predictsigma
in other form.
I think it should be right. The time has passed too long. Are you still studying the uncertainty estimation problem? Assuming that the noise term is a multivariate Normal distribution, how to construct a full covariance matrix to represent the latent distribution? How is it reflected in the code?
Hi, when assuming the output of the network is a multi-variate Gaussian distribution, we also assume that features are independent of each other. Therefore, the covariance matrix of our multi-variate Gaussian distribution is a diagonal matrix, with one variance element for each feature on the diagonal. In the code, we implement it as sigma
in:
https://github.com/ShellingFord221/My-implementation-of-What-Uncertainties-Do-We-Need-in-Bayesian-Deep-Learning-for-Computer-Vision/blob/61f14395b189264f276d683972dd9c5786c0d55a/classification_aleatoric.py#L102
Then we use sigma
as well as mu
to draw samples from this multi-variate Gaussian distribution to generate multiple predictions about the input (i.e. Eq. 12 in the original paper):
https://github.com/ShellingFord221/My-implementation-of-What-Uncertainties-Do-We-Need-in-Bayesian-Deep-Learning-for-Computer-Vision/blob/61f14395b189264f276d683972dd9c5786c0d55a/classification_aleatoric.py#L108
You can also see the discussion in Issue #1 . Hope that helps.
the covariance matrix of our multi-variate Gaussian distribution is a diagonal matrix
Assuming that features are dependent of each other, the covariance matrix of our multi-variate Gaussian distribution is a full matrix, how to get the final logit?
Normally we do not assume features are dependent of each other. They are orthogonal. If they are skewed, I think there are two ways to tackle this problem. One is to reduce dimension of vectors to make sure left features are independent. The other one is to learn the relationship between dependent features then make predictions according to full covariance matrix. This can be done by Gaussian Process, which sets kernels to accommodate feature dependence then optimizes hyper-parameters in kernels. This may beyond the scope of deep learning, but may provide a solution to solve your problem.
Normally we do not assume features are dependent of each other. They are orthogonal. If they are skewed, I think there are two ways to tackle this problem. One is to reduce dimension of vectors to make sure left features are independent. The other one is to learn the relationship between dependent features then make predictions according to full covariance matrix. This can be done by Gaussian Process, which sets kernels to accommodate feature dependence then optimizes hyper-parameters in kernels. This may beyond the scope of deep learning, but may provide a solution to solve your problem.
Normally we do not assume features are dependent of each other. They are orthogonal. If they are skewed, I think there are two ways to tackle this problem. One is to reduce dimension of vectors to make sure left features are independent. The other one is to learn the relationship between dependent features then make predictions according to full covariance matrix. This can be done by Gaussian Process, which sets kernels to accommodate feature dependence then optimizes hyper-parameters in kernels. This may beyond the scope of deep learning, but may provide a solution to solve your problem.
I don’t know if you have read the "Correlated Input-Dependent Label Noise in Large-Scale Image Classification" and "Stochastic Segmentation Networks: Modelling Spatially Correlated Aleatoric Uncertainty" papers, which assume that the features are dependent and make a low-rank approximation. I just don’t understand the logit generation, but thank you very much for your answer, thank you
Normally we do not assume features are dependent of each other. They are orthogonal. If they are skewed, I think there are two ways to tackle this problem. One is to reduce dimension of vectors to make sure left features are independent. The other one is to learn the relationship between dependent features then make predictions according to full covariance matrix. This can be done by Gaussian Process, which sets kernels to accommodate feature dependence then optimizes hyper-parameters in kernels. This may beyond the scope of deep learning, but may provide a solution to solve your problem.
In addition, have you tested the effectiveness of this method with other datasets? I replaced other datasets, such as face, and found that this method does not bring performance improvements. Does it have a lot to do with the choice of backbone network? I use resnet18 and add dropout layer behind each layer.