knowledge-distillation-pytorch copied to clipboard
An issue on loss function
I suggest both training loss function without KD and with KD should add a softmax function, because the outputs of models are without softmax. Just like this.
KD_loss = nn.KLDivLoss()(F.log_softmax(outputs/T, dim=1), F.softmax(teacher_outputs/T, dim=1)) * (alpha * T * T) + \ F.cross_entropy(F.softmax(outputs,dim=1), labels) * (1. - alpha)
return nn.CrossEntropyLoss()(F.softmax(outputs,dim=1), labels)
For another thing, why does the first part of the KD loss function in multiply 2?
One more thing, it is not necessary to multiply T*T if we distill only using soft targets.
reference Distilling the Knowledge in a Neural Network
I was wondering if the multiplication of T square is really helpful? Because if T=20, the soft loss will dominate the total loss. And there is no need to add extra softmax for the hard target as it is already embedded in nn.functional.cross_entropy. @lhyfst
As @erichhhhho pointed out, it's indeed no need to manually add extra softmax. From the reference paper, it looks like T^2 is only required when using BOTH hard/soft targets.
Thank you, everybody! So, why does the first part of the KD loss function in multiply 2?
Thank you, everybody! So, why does the first part of the KD loss function in multiply 2?
As per distiller KD_Loss is effectively the following equation:
α * kl_divergence + β * cross_entropy
And Hinton et al. 2015 originally used a weighted average, i.e. α = 1 - β
, but this is not strictly necessary. α and β can also be arbitrary and don't need to sum to 1. In this particular MNIST example, the relationship is α = 2 * (1 - β)
, maybe they were experimenting with a stronger reliance on kl_div.