Teacher-free-Knowledge-Distillation
Teacher-free-Knowledge-Distillation copied to clipboard
The baseline of ResNet18 on CIFAR100 is relatively lower
Hi, I would first appreciate your work for interpreting the relationship between the KD and LSR. However, the baseline of ResNet18 on cifar100 is much lower than the implementation pytorch-cifar100, which may be caused by the modified ResNet. In fact, based on the pytorch-cifar100, without any extra augmentations, the top1 accuracy can achieve up to 78.05% in my previous experiments. So I would cast my doubt on the performance gain of the self distillation. And I have conducted an experiment using the distillation, which improves the baseline from 77.96% to 78.45%. It does improve the performance yet not conspicuous as the paper claimed.
Hi,
Q. "In fact, based on the pytorch-cifar100, without any extra augmentations, the top1 accuracy can achieve up to 78.05% in my previous experiments."
A: I also try this repo, but same as it, ResNet18 only achieve around 76%, similar with our paper. The following is the results of pytorch-cifar100, in which ResNet18 achieved 75.61% but not 78%.
Q. "And I have conducted an experiment using the distillation, which improves the baseline from 77.96% to 78.45%." A: Did you tune your hyper-parameters when using the distillation, because if you only try some hyper-parameters, it's normal that the improvement is not significant.
By the way, we don't use extra augmentations for our method, it still a fair comparison that we also don't use extra augmentations in baseline (original KD or LSR).
Hi, here is my training log, and you can reproduce the result using the repo. , which achieves ~78.05% top-1 accuracy without extra augmentations. I think the distillation does work yet not conspicuous, and it could only improve about 0.5% in my setting.
Hi, your implementation is different with the original pytorch-cifar100, the original pytorch-cifar100 can not achieve ~78.05% top1 accuracy. About the improvement by our method, the improvement also depends on your hyper-parameters, and I also don't know if you search the hyper-parameters or not, so it is normal improve about 0.5% by your implementation.