DLR Loss and Zero Gradient
Hi, I am trying to understand how DLR loss works, and in particular, I am running some tests with Batch Size = 1.
The DLR loss defined in https://github.com/fra31/auto-attack/blob/master/autoattack/autopgd_base.py is as follows:
def dlr_loss(self, x, y):
x_sorted, ind_sorted = x.sort(dim=1)
ind = (ind_sorted[:, -1] == y).float()
u = torch.arange(x.shape[0])
return -(x[u, y] - x_sorted[:, -2] * ind - x_sorted[:, -1] * (1. - ind)) / (x_sorted[:, -1] - x_sorted[:, -3] + 1e-12)
When using Batch Size 1, x are the logits of the sample (shape 1, N classes) while y is the ground truth label of shape 1.
Now, let's suppose the logits (x) are something like the following (N=4 classes):
[[3, 2, 5, 1]]
and that GT label (y) = 1
then:
x_sorted = [1, 2, 3, 5]
ind_sorted = [3, 1, 0, 2]
ind = 0.
which leads to:
x[u, y] = 2
x_sorted[:,-2] * ind = 0
x_sorted[:,-1] * (1 - ind) = 5
x_sorted[:, -1] = 5
x_sorted[:, -3] = 2
This causes the loss to be:
-(2-5) / (5-2 +1e-12) = 3 / 3 = 1.
Which, in turn, causes the gradient to be zero, which may lead to unrealiable behavior. However, the check_zero_gradients method is called only at lines 285-287, and not in the main loop (e.g. at line 380).
In shorter terms, if the ground truth label is the third largest predicted logit, the loss goes to one and the gradient vanishes. Is my understanding correct? Is this a known issue? Also, to prevent this, one solution may be to replace x_sorted[:, -3] with x_sorted[:, -4] in case where GT = the third highest logit. Does this make sense?
您好,您的邮件已收到!