mean-teacher
mean-teacher copied to clipboard
ImageNet Training Loss Very High (Error)
File "./main.py", line 166, in main train(train_loader, train_loader_len, model, ema_model, ema_model, optimizer, epoch, training_lo File "./main.py", line 492, in train assert not (np.isnan(loss.data[0]) or loss.data[0] > 1e5), 'Loss explosion: {}'.format(loss.data AssertionError: Loss explosion: 1088561.875
I trained with 8 GPUs which should be close enough to the 10 GPU setting in the provided configuration. Is this expected?
Hi,
Thanks for the report. I am the first author of the paper and wrote most of the code. My first guess would be that I made a mistake when cleaning up the code. (Sorry if that’s indeed the case.)
A few clarifying questions:
- What command did you use to run the code?
- What commit of the repo did you use?
- What versions of Python, PyTorch, NumPy and CUDA are you using?
- What GPUs are you using?
- How long did the network train before the explosion?
I hope to be able to help you but it may be complicated a bit by the fact that I don’t work at the CuriousAI anymore and don’t have access to the infrastructure that I used to run the experiments.
Hi, Thank you for the response!
I used the suggested command 'python python -m experiments.imagenet_valid'. All the code was from the latest repo commit.
I'm using python3.6, pytorch 0.3.1.post2, numpy 1.14.2, and cuda8.0. I'm using 8 Titan Xp GPUs.
The network trains for about 1500 iterations and throws an error.
On the other note, would it be possible to release the pretrained model on ImageNet in this semi-supervised setting? That would be quite excellent!! Thank you!
Okay, I will see if I can find any obvious mistakes I might have made between the version I used for the experiments and the latest published version.
Regarding the trained parameters, I have to ask the CuriousAI folks since it’s in their ownership.
Thank you. I managed to get it training with 8 GPUs. Not sure what the issue was really.
However, I'm getting Prec@1 0.000 (0.072) Prec@5 0.625 (0.373) or sometimes Prec@5 1.250 (0.372) and it's been a few epochs now. This seems quite low. Is it normal?
Hi,
Just a note that I have not investigated this yet due to lack of time and some technical problems. I will try to get to it tomorrow.
Thank you so much! That would be very much appreciated!
Updates:
- I reviewed the changes between the actual experiment and the published code but unfortunately didn't find any mistakes.
- I asked @filipgrano from Curious AI for the trained weights of the run. I hope I put them in a place where they can still be found.
Question:
- What does the first line say when you start the run? The one that starts with
Using these command line args:
I just got answer from @filipgrano: unfortunately the trained parameters are lost. Apparently I saved them in a transient storage that has been already purged. That was very short-sighted of me, sorry.
This is what I got when the training started.
Using these command line args: --batch-size 320 --labeled-batch-size 160 --lr 0.2 --labels data-local/labels/ilsvrc2012/128000_balanced_labels/00.txt --workers 20 --checkpoint-epochs 1 --evaluation-epochs 1 --dataset imagenet --exclude-unlabeled False --arch resnext152 --ema-decay 0.9997 --consistency-type kl --consistency 10.0 --consistency-rampup 5 --logit-distance-cost 0.01 --weight-decay 5e-05 --epochs 60 --lr-rampdown-epochs 75 --lr-rampup 2 --initial-lr 0.1 --nesterov True
=> creating model 'resnext152'
=> creating EMA model 'resnext152'
Thank you so much for checking with the company.
Do you think you have the resources to train the imagenet experiment and release the pre-trained model? That would really help!
At epoch 30 I see similar things and I suspect that the model doesn't learn much yet.
Epoch: [31][6140/7207] Time 1.598 (1.600) Data 0.001 (0.024) Class 3.4555 (3.4547) Cons -0.0000 (-0.0000) Prec@1 0.000 (0.070) Prec@5 0.000 (0.363)
Hi,
Thanks for the console output. It looks correct to me.
Unfortunately, at the moment I personally don’t have the resources to run more experiments. I’m contemplating returning to research within about a month, and at that point I may have a better chance to help you.
Yes, sorry, I forgot to comment on the training speed. I found a validation error curve of the runs in my notes (see below). The final experiment runs are the pink ones. The other curves are smaller models with varying amounts of labels and images and possibly different hyperparams. As you can see, you should reach accuracy 80% or so already after 5 epochs. (Why so quickly? Because the epochs are defined as going through the unlabeled examples once, and the training has seen each labeled example dozens of times by that point.) You should see steady improvement for the entire training. Actually the training did not even reach convergence; I had selected the number of epochs so that I can get the final results before the conference.
Curious AI may still have some training logs etc. from the runs. I asked them if I could access that data and share the useful parts with you.
Hi,
Thanks for your detailed response. The console log indicates that at Epoch 31 the prediction accuracy is still 0% for both prec@1 and prec@5. My training does not seem correct according to your graph.
Any updates on the training log? That would be most helpful. Thanks!
Also if you could provide the pip requirements from
pip freeze
and/or conda environment details for
conda list -e
that would be most helpful! Thank you.
@filipgrano @jrosti Any updates on this?
@benathi I shared the training logs with you by email.
If anyone has gotten ImageNet training to work, I'd be thrilled to get some pointers for getting started.
So far I've been getting:
Epoch: [0][0/7317] Time 62.236 (62.236) Data 13.409 (13.409) Class 3.5255 (3.5255) Cons 0.0281 (0.0281) Prec@1 0.000 (0.000) Prec@5 1.250 (1.250)
[...]
AssertionError: Loss explosion: 146667.828125`
To answer @tarvaina's questions:
What command did you use to run the code?
python -m experiments.imagenet_valid
What commit of the repo did you use?
commit bd4313d5691f3ce4c30635e50fa207f49edf16fe
Author: Vik Kamath <[email protected]>
Date: Thu May 31 14:09:17 2018 +0300
Add license information
- Addresses Issue #15
What versions of Python, PyTorch, NumPy and CUDA are you using?
Python 3.5.5, PyTorch 0.3.1.post2, NumPy 1.14.3, CUDA 9.0
What GPUs are you using?
8 v100's
How long did the network train before the explosion?
Immediately
I also have not been able to get ImageNet training to run successfully. I tried switching to MSE consistency loss instead of KL and I didn't get any loss explosion -- however, the accuracy at the end is not high, only around 84%.
@tarvaina If you could please look into it that would be great.
Thanks for the reports and sorry for the problems. I will try to find (personal and computational) time to look into this next week.
@avital Hi, I have come across with the same problem. The loss exploded immediately after the first iteration. Have you find a way to solve that?
I updated the code to be compatible with pytorch1.0.0 and the loss explosion error never showed up again. I think the issue may result from different reduction methods between various pytorch versions.
Hi,
Thanks for your detailed response. The console log indicates that at Epoch 31 the prediction accuracy is still 0% for both prec@1 and prec@5. My training does not seem correct according to your graph.
Any updates on the training log? That would be most helpful. Thanks!
@benathi I met the same situation. The pred@1 and pred@5 keeps 0% even at 40 epoch. How did you solve that? Thanks!