RepDistiller
RepDistiller copied to clipboard
questions about ContrastMemory
Hi, according to Eq.19 in the paper, linear transform gT and gS are conducted on the teacher and student, respectively, i.e., gT(t), gS(s).
But as for your codes, the teacher transform gT is applied on the student feature, gT(s) , and the student transform gS is applied on the teacher feature, gS(t), like
out_v2 = torch.bmm(weight_v1, v2.view(batchSize, inputSize, 1))
out_v2 = torch.exp(torch.div(out_v2, T))
out_v1 = torch.bmm(weight_v2, v1.view(batchSize, inputSize, 1))
out_v1 = torch.exp(torch.div(out_v1, T))
and thus your contrast loss changes to be the addition of ContrastLoss(out_v1) + ContrastLoss(out_v2).
I wonder why you did this , instead of calculating output like Eq.19 by gT(t)*gS(s)/t and ContrastLoss(out).
Thanks.
I had the same question, as I understand, ContrastLoss(out_v2) will not have any gradients given that the teacher is not being trained.
I had the same question, as I understand, ContrastLoss(out_v2) will not have any gradients given that the teacher is not being trained.
the last fc layer is being trained in the teacher.
Yes you are right, thanks.