mean-teacher
mean-teacher copied to clipboard
Losses
Hi, thank you for your great project!
I’m stuck with two problems while trying to test the mean teacher idea as described in your NIPS 2017 presentation with a MNIST dataset and a simple convnet from official Pytorch examples using your Pytorch code:
- Loss is defined as:
loss = class_loss + consistency_loss + res_loss
where
if args.consistency:
(...)
consistency_loss = consistency_weight * consistency_criterion(cons_logit, ema_logit) / minibatch_size
(…)
else:
consistency_loss = 0
but default value of args.consistency
is None, so consistency_loss=0
by default
Similarly,
if args.logit_distance_cost >= 0:
(…)
else:
(…)
res_loss = 0
but args.logit_distance_cost=-1
by default
So using the default values switches the mean teacher off and just an ordinary supervised model remains? Should these losses be complimentary or interchanged?
- Training a mean teacher model on MNIST with some consistency weight without res_loss with fixed hyperparameters (https://github.com/rracinskij/mean_teacher/blob/master/mean_teacher.py) gives significantly lower test accuracy (~78% with 1000 labels) compared to setting the consistency weight to zero (~92%).
I’d greatly appreciate any comments or hints.
I think to make mean teacher work, you have to set the consistency_weight to some value. In the mean teacher pytorch webpage, it is set to 100.0. The logit_distance_cost is set to 0.01 for Cifar10 experiment. I believe these are necessary to make mean teacher work.
It looks like that the logit_distance_cost should be set to some positive value only if the student model has two outputs. And yes, total loss depends on the teacher model only if the consistency_weight is non-zero. But then the accuracy of my minimalistic MNIST implementation is lower compared to a single convnet.
Hi,
So if I understood correctly, your dataset is MNIST with 1000 labeled and 59000 unlabeled examples? And you are using a convolutional network with mean teacher and comparing the results against a bare convolutional network?
Yes, you should set consistency > 0. The best value for consistency may depend on the dataset, the mix of unlabeled/labeled per batch, and other things. A bad consistency cost can lead to worse performance than not using any. Also ema_decay parameter may effect performance a lot. See Figure 4 in the paper for what these look like for SVHN.
At the beginning of the training, the labeled examples are much more useful than the unlabeled examples. If you have a high consistency cost in the beginning, it may hurt the learning. There are two ways around it: either use a consistency ramp-up or use logit_distance_cost > 0 (and yes, two outputs from the network). Also these are hyperparameters that may require tuning.
See also https://github.com/CuriousAI/mean-teacher#tips-for-choosing-hyperparameters-and-other-tuning if you didn't already.
Hi there, I notice this problem, too. As we known that paper mentioned just 2 kind of loss(class loss and consistency loss) to optimize, what's the situation that student model has 2 output? I saw the difference between the output is that using different fc layer. Is it because representation learning or some stuff?
Thanks a lot!