robust_loss_pytorch
robust_loss_pytorch copied to clipboard
why use the log function to regularize the scale?
Hi, I have a question about the implementation. In the Distribution().nllfun method, to regularize the scale to decrease, why you use the log function? I think the l2 or l1 function is common.
https://github.com/jonbarron/robust_loss_pytorch/blob/9831f1db8006105fe7a383312fba0e8bd975e7f6/robust_loss_pytorch/distribution.py#L208
Log(scale) shouldn't be thought of as a regularizer, it's the log of the partition function of a probability distribution. Basically, this is not a "design decision", like L2 or L1 weight decay --- it ensures that the PDF implied by the loss function being viewed as a negative log-likelihood sums to 1, and it's the only thing you can minimize here that does that.
Ok, I see, thank you very much. Another question, I see the adaptiveness can be realized through the negative log-likelihood in Equation (16). However, why is it reasonable? I note that you have a qualitative analysis in the first page and Figure 2, but, what's the fundamental theory behind the idea?
This is a good idea if 1) you believe in maximum likelihood estimation (https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) and 2) if you want to maximize the likelihood of the observed data you're training on.
WANDERFUL! Thank you very much.