DINO
DINO copied to clipboard
Focal cost in matcher.py
The class cost in matcher.py is computed as follows:
# Compute the classification cost.
neg_cost_class = (1 - alpha) * (out_prob ** gamma) * (-(1 - out_prob + 1e-8).log()) # line 79
pos_cost_class = alpha * ((1 - out_prob) ** gamma) * (-(out_prob + 1e-8).log()) # line 80
cost_class = pos_cost_class[:, tgt_ids] - neg_cost_class[:, tgt_ids] # line 81
Definition of focal loss is as follows:
My question is why are both neg_cost_class and pos_cost_class computed and summed? The cost should always be computed on a target class and corresponding predicted probability for the same class. That means Y is always 1. The way I understand it, your code basically always adds both branches. Did I misinterpret something?
Thank you for the great work! It helped me a lot.
We follow previous works to use focal cost which contains both positive and negative parts. We have a negative part because we do not only expect a prediction to have a high predicted probability for the positive class but also expect it to have low probabilities for negative classes. Note that we use sigmoid to output probabilities where each probability is independent of others. Therefore, we need to explicitly lower the negative ones.
I understand that is the reason we generally use the negative branch in the focal loss. What I don't understand is how you do that in your implementation, since in line 81 you, rightfully, filter out the negative samples, but you still add the negative branch. For comparison, this is the original DETR implementations (w/o focal loss):
# Compute the classification cost. Contrary to the loss, we don't use the NLL,
# but approximate it in 1 - proba[target class].
# The 1 is a constant that doesn't change the matching, it can be ommitted.
cost_class = -out_prob[:, tgt_ids]
They use only the positive branch of the NLL (actually the aproximation of the NLL) and that makes sense to me.
Any thoughts?
@berceanbogdan We do not filter our negative examples. cost_class is a matrix contains both positive and negative costs.
I understand that is the reason we generally use the negative branch in the focal loss. What I don't understand is how you do that in your implementation, since in line 81 you, rightfully, filter out the negative samples, but you still add the negative branch. For comparison, this is the original DETR implementations (w/o focal loss):
# Compute the classification cost. Contrary to the loss, we don't use the NLL, # but approximate it in 1 - proba[target class]. # The 1 is a constant that doesn't change the matching, it can be ommitted. cost_class = -out_prob[:, tgt_ids]
They use only the positive branch of the NLL (actually the aproximation of the NLL) and that makes sense to me.
Any thoughts?
If only positive cost term is used, when target positive probability is close to 1, its gradient will be close to zero and does about zero effect in class matching. If negated positive probability is used, its gradient will always be 1.
The good thing of focal cost is that it is very sensitive when p is close to 0 or 1, meaning class will be matched better in later stages.