mean-teacher icon indicating copy to clipboard operation
mean-teacher copied to clipboard

Cannot reproduce the error rate 10.08+/-0.41 with mean teacher + ResNet on 1000 label

Open SHMCU opened this issue 5 years ago • 9 comments

Dear authors,

We tried to reproduce mean teacher with shakeshake26 network on 1000 labels on CIFAR-10. We failed to reproduce the error rate 10.08+/-0.41 reported in the appendix A for CIFAR-10 on 1000 labels. Is the number correct?

Thank you, Hai

SHMCU avatar Nov 08 '18 23:11 SHMCU

It should be. What did you do to reproduce it and what results did you get?

tarvaina avatar Nov 09 '18 07:11 tarvaina

Thank you! Are the results in the paper obtained by Tensorflow or Pytorch implementations? I am using the Pytorch code for training the shakeshake26 network on 1000 label. I set the ema-decay to 0.97 (constant over all epochs), weight decay 2e-4, learing rate 0.2 and cosine rampdown. 180 epochs and 210 epochs. These settings are found in the appendix. It can only reach ~81%. It seems that the Pytorch implementation did not add Gaussian noise to the input images.

Thank you, Hai

SHMCU avatar Nov 10 '18 04:11 SHMCU

Hello Tarvaina,

I found I missed one setting, the MSE error between the two different logits output by the student model. In the appendix, it says the cost of the MSE between the two logits is set to 0.01. However, the paper also says this MSE cost is equivalent to the rampup for alpha. So is this MSE cost necessary for reproducing the results?

Thank you, Hai

SHMCU avatar Nov 12 '18 17:11 SHMCU

Did you try running pytorch/experiments/cifar10_test.py? It contains all the pytorch CIFAR experiments for the paper. You may want to comment out the other runs besides the 1000-label case.

tarvaina avatar Nov 12 '18 19:11 tarvaina

Thank you Tarvaina! I have not run this script yet. I will try cifar10_test.py

SHMCU avatar Nov 12 '18 19:11 SHMCU

I have tried cifar10_test.py. For 4000 labels, it can reach to the reported performance. However, on 1000 labels, it can only reach to 82.33%. I kept all the hyper parameters unchanged in the code. Switch between 4000 labels and 1000 labels by commenting out one of them. I still cannot get close to 89.92%. Can you guess what is the possible reason for the 82.33%? I found that if I train with 1000 labels for longer epochs, e.g. 410 epochs, it can reach to 88.87%. But it still cannot go to 89.92%.

Thanks!

SHMCU avatar Nov 18 '18 21:11 SHMCU

Are you using 4 GPUs? The experiments were run with 4 and the batch sizes and the results depend on their number.

(A little late for this, but the way to reproduce the results is described at the bottom of pytorch/README.md by the way.)

tarvaina avatar Nov 25 '18 17:11 tarvaina

Now I can reproduce the result on 1000 labels. With the hyper parameter settings in the README.md, I trained it on one GPU. I trained it for really long epochs (450 epochs), finally it reaches to ~89.91%. :)

I am curious about why training on 4 GPUs requires less epochs. Could you kindly give me some hints about this? Thank you very much!

SHMCU avatar Nov 28 '18 21:11 SHMCU

Is that because when training on multiple GPU, the loss is computed on each GPU separately and then gather and average together. The average loss is actually larger than the average loss if they all would have been computed on one GPU given sufficient GPU memory? So training on multiple GPU actually increases the learning rate by some extent? Thanks!

SHMCU avatar Nov 29 '18 00:11 SHMCU