MultiDepth Question about the multi-task loss?

Hello!Thanks for this nice work! I think there is a little different between the code of multi-task loss and the formula in the original paper. In the paper, the weight of each task should be: w = torch.exp(-s^2) However, in this line of code https://github.com/lukasliebel/MultiDepth/blob/7478d355d8b7c5da7866fc335597a43073a712c9/train.py#L267 w = torch.exp(-s) Could you tell me why? Best regards!

Dec 26 '19 02:12 huangxGo

Hello!Thanks for this nice work! I think there is a little different between the code of multi-task loss and the formula in the original paper. In the paper, the weight of each task should be: w = torch.exp(-s^2) However, in this line of code https://github.com/lukasliebel/MultiDepth/blob/7478d355d8b7c5da7866fc335597a43073a712c9/train.py#L267

w = torch.exp(-s) Could you tell me why? Best regards!

Oh I check the paper Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics and I know the reason now!

Dec 31 '19 06:12 huangxGo

Hi, could you please tell me how this difference can be explained? Thank you.

Best regards.

Feb 17 '21 12:02 vkress

Hey there,

thanks for your interest in our work! :)

Actually, there is no difference as far as I can see. Equation 10 in Alex's paper tells us to use a joint loss loss_mt = 1/(2*sigma_reg^2) * loss_reg + log(sigma_reg) + 1/sigma_cls^2 * loss_cls + log(sigma_cls). We implement the single-task weighting terms as 0.5 * exp(-log(sig^2)) = 1/(2*sig^2) and exp(-(log(sigma_cls^2))) = 1/sigma_cls^2 as given in our paper.

Keep in mind that, as noted here: https://github.com/lukasliebel/MultiDepth/blob/7478d355d8b7c5da7866fc335597a43073a712c9/train.py#L264 we optimize for s := log(sigma^2) rather than sigma as proposed by Alex and mentioned in our paper.

I hope this helps. Cheers, Lukas

Feb 17 '21 14:02 lukasliebel

Sorry, I did not make myself quite clear. I understand that there is no difference in the implementation and in the original paper. But in your own paper you introduced weighting parameters w_reg=0.5*exp(-s_reg^2) and w_cls=exp(-s_cls^2). I wonder where the power of 2 comes from. In fact, I have used the definition with power of 2. It works well. Without the power, the loss does not work for my problem because one of my loss functions tends to 0 quite quickly. In this case, the corresponding weighting factor explodes.

Thanks for your help.

Feb 17 '21 14:02 vkress

Oh, alright. My bad. Seems more like I missed the actual question all along... I'll look into it! I'm not quite sure if this may be an unfortunate typo in the paper.

But to sort out your issue with the whole thing first: following my paper you implemented exp(-(log(sigma^2)^2)), right? This should basically cap at w = 1 if I'm not mistaken. This, of course would keep your weights from exploding. What if you tackle your problem by increasing the influence of the regularization term instead? In my experiments (not only for this project) I never came across the problem of exploding weighting terms. Are your single-task losses approximately in the same order of magnitude to begin with? This may also be a problem. If not, just try to apply a constant multiplier to one of them to bring them to a similar level. This helped me in various settings in the past.

What sort of single-task losses are you dealing with?

Thanks for bringing this potential issue up (again), after I apparently missed it the first time ;)

Feb 17 '21 15:02 lukasliebel

You are right, the weight is capped at w=1, but then drops back down to 0. I think the reason for this is the regularization term 0.5*s, which continues to decrease. My single-task losses have very different magnitudes. I will follow your advice and adjust the scales.

In fact, my problem is not really a multi task problem. I am trying to apply the technique to constrain my actual loss function. Specifically, I am using cross entropy for a classification task. Furthermore, I want to explicitly penalize that certain classes are predicted, which could be excluded in advance. For this purpose, my second loss function contains the sum of the predicted probabilities for the discarded classes. Accordingly, the value of the second loss function is much smaller and much easier to learn.

Feb 17 '21 16:02 vkress

Haha. Nice to hear that. In fact, I'm doing almost the exact same thing right now. Multi-class classification (where one of the classes is background) + foregrund/background separation. This was one of the cases were it helped to bring them to a similar order of magnitude. Didn't employ the uncertainty weighting yet though. Keep me updated please ;) Feel free to message me directly if you prefer to. My mail address is in my profile! Would love to exchange some thoughts on that.

Feb 17 '21 16:02 lukasliebel