zhazhaming
zhazhaming
This ImageNet model is downloaded from another repo of MobileNetv2 on github with most stars, i have no time checking on this carefully cause im busy working on my company...
If alpha=0.95, t=6, then alpha*t*t would be far larger than 1-alpha, i don't think the student is not mimicking the teachers' output, the loss is almost depending on the kl-divergence...
thx for answering, i need to have a detailed look at your report and do some experiments
I split the loss into two parts, one is the cross entropy between outputs and teacher outputs with temperature T, the other is the cross entropy between outputs and labels....