Nasty-Teacher
Nasty-Teacher copied to clipboard
The asymmetry of KL divergence.
Hi, I notice that in the 'train_nasty.py', when the KL divergence is computed, normal teacher's output (output_stu) is regarded as input and nasty teacher's output (output_tch) is regarded as target. However, in general KD, the fixed model (teacher) is usually regarded as the target and the model that needs update is regarded as the input.
I wonder why you adopt an opposite order in KL loss function. Is there any point here? Thanks!
Thanks for your asking. I am sorry for the ambiguous variable names. Nevertheless, the name of variables does not affect the results of our paper.
Since we aim to build a nasty teacher model, I just set the name of the outputs from the nasty teacher (the model we want to update) as output_tch (https://github.com/VITA-Group/Nasty-Teacher/blob/main/train_nasty.py#L56). I set it as "output_stu" simply because at the very beginning of this project, I tried to use a student network here and co-train them together, but later I found that this idea didn't work and I just kept the variable names the same for my other ideas.
Maybe I should change the name of the output from the fixed model (output_stu in https://github.com/VITA-Group/Nasty-Teacher/blob/main/train_nasty.py#L64) to output_adv to make things clear.