mean-teacher
mean-teacher copied to clipboard
Cannot reproduce the error rate 10.08+/-0.41 with mean teacher + ResNet on 1000 label
Dear authors,
We tried to reproduce mean teacher with shakeshake26 network on 1000 labels on CIFAR-10. We failed to reproduce the error rate 10.08+/-0.41 reported in the appendix A for CIFAR-10 on 1000 labels. Is the number correct?
Thank you, Hai
It should be. What did you do to reproduce it and what results did you get?
Thank you! Are the results in the paper obtained by Tensorflow or Pytorch implementations? I am using the Pytorch code for training the shakeshake26 network on 1000 label. I set the ema-decay to 0.97 (constant over all epochs), weight decay 2e-4, learing rate 0.2 and cosine rampdown. 180 epochs and 210 epochs. These settings are found in the appendix. It can only reach ~81%. It seems that the Pytorch implementation did not add Gaussian noise to the input images.
Thank you, Hai
Hello Tarvaina,
I found I missed one setting, the MSE error between the two different logits output by the student model. In the appendix, it says the cost of the MSE between the two logits is set to 0.01. However, the paper also says this MSE cost is equivalent to the rampup for alpha. So is this MSE cost necessary for reproducing the results?
Thank you, Hai
Did you try running pytorch/experiments/cifar10_test.py? It contains all the pytorch CIFAR experiments for the paper. You may want to comment out the other runs besides the 1000-label case.
Thank you Tarvaina! I have not run this script yet. I will try cifar10_test.py
I have tried cifar10_test.py. For 4000 labels, it can reach to the reported performance. However, on 1000 labels, it can only reach to 82.33%. I kept all the hyper parameters unchanged in the code. Switch between 4000 labels and 1000 labels by commenting out one of them. I still cannot get close to 89.92%. Can you guess what is the possible reason for the 82.33%? I found that if I train with 1000 labels for longer epochs, e.g. 410 epochs, it can reach to 88.87%. But it still cannot go to 89.92%.
Thanks!
Are you using 4 GPUs? The experiments were run with 4 and the batch sizes and the results depend on their number.
(A little late for this, but the way to reproduce the results is described at the bottom of pytorch/README.md by the way.)
Now I can reproduce the result on 1000 labels. With the hyper parameter settings in the README.md, I trained it on one GPU. I trained it for really long epochs (450 epochs), finally it reaches to ~89.91%. :)
I am curious about why training on 4 GPUs requires less epochs. Could you kindly give me some hints about this? Thank you very much!
Is that because when training on multiple GPU, the loss is computed on each GPU separately and then gather and average together. The average loss is actually larger than the average loss if they all would have been computed on one GPU given sufficient GPU memory? So training on multiple GPU actually increases the learning rate by some extent? Thanks!