mean-teacher ImageNet Training Loss Very High (Error)

File "./main.py", line 166, in main train(train_loader, train_loader_len, model, ema_model, ema_model, optimizer, epoch, training_lo File "./main.py", line 492, in train assert not (np.isnan(loss.data[0]) or loss.data[0] > 1e5), 'Loss explosion: {}'.format(loss.data AssertionError: Loss explosion: 1088561.875

I trained with 8 GPUs which should be close enough to the 10 GPU setting in the provided configuration. Is this expected?

Apr 17 '18 23:04 benathi

Hi,

Thanks for the report. I am the first author of the paper and wrote most of the code. My first guess would be that I made a mistake when cleaning up the code. (Sorry if that’s indeed the case.)

A few clarifying questions:

What command did you use to run the code?
What commit of the repo did you use?
What versions of Python, PyTorch, NumPy and CUDA are you using?
What GPUs are you using?
How long did the network train before the explosion?

I hope to be able to help you but it may be complicated a bit by the fact that I don’t work at the CuriousAI anymore and don’t have access to the infrastructure that I used to run the experiments.

Apr 18 '18 06:04 tarvaina

Hi, Thank you for the response!

I used the suggested command 'python python -m experiments.imagenet_valid'. All the code was from the latest repo commit.

I'm using python3.6, pytorch 0.3.1.post2, numpy 1.14.2, and cuda8.0. I'm using 8 Titan Xp GPUs.

The network trains for about 1500 iterations and throws an error.

On the other note, would it be possible to release the pretrained model on ImageNet in this semi-supervised setting? That would be quite excellent!! Thank you!

Apr 18 '18 17:04 benathi

Okay, I will see if I can find any obvious mistakes I might have made between the version I used for the experiments and the latest published version.

Regarding the trained parameters, I have to ask the CuriousAI folks since it’s in their ownership.

Apr 19 '18 05:04 tarvaina

Thank you. I managed to get it training with 8 GPUs. Not sure what the issue was really.

However, I'm getting Prec@1 0.000 (0.072) Prec@5 0.625 (0.373) or sometimes Prec@5 1.250 (0.372) and it's been a few epochs now. This seems quite low. Is it normal?

Apr 20 '18 19:04 benathi

Hi,

Just a note that I have not investigated this yet due to lack of time and some technical problems. I will try to get to it tomorrow.

Apr 24 '18 06:04 tarvaina

Thank you so much! That would be very much appreciated!

Apr 24 '18 16:04 benathi

Updates:

I reviewed the changes between the actual experiment and the published code but unfortunately didn't find any mistakes.
I asked @filipgrano from Curious AI for the trained weights of the run. I hope I put them in a place where they can still be found.

Question:

What does the first line say when you start the run? The one that starts with Using these command line args:

Apr 25 '18 12:04 tarvaina

I just got answer from @filipgrano: unfortunately the trained parameters are lost. Apparently I saved them in a transient storage that has been already purged. That was very short-sighted of me, sorry.

Apr 25 '18 12:04 tarvaina

This is what I got when the training started.

Using these command line args: --batch-size 320 --labeled-batch-size 160 --lr 0.2 --labels data-local/labels/ilsvrc2012/128000_balanced_labels/00.txt --workers 20 --checkpoint-epochs 1 --evaluation-epochs 1 --dataset imagenet --exclude-unlabeled False --arch resnext152 --ema-decay 0.9997 --consistency-type kl --consistency 10.0 --consistency-rampup 5 --logit-distance-cost 0.01 --weight-decay 5e-05 --epochs 60 --lr-rampdown-epochs 75 --lr-rampup 2 --initial-lr 0.1 --nesterov True
=> creating model 'resnext152'
=> creating EMA model 'resnext152'

Apr 25 '18 20:04 benathi

Thank you so much for checking with the company.

Do you think you have the resources to train the imagenet experiment and release the pre-trained model? That would really help!

At epoch 30 I see similar things and I suspect that the model doesn't learn much yet.

Epoch: [31][6140/7207]  Time 1.598 (1.600)      Data 0.001 (0.024)      Class 3.4555 (3.4547)   Cons -0.0000 (-0.0000)  Prec@1 0.000 (0.070)    Prec@5 0.000 (0.363)

Apr 25 '18 20:04 benathi

Hi,

Thanks for the console output. It looks correct to me.

Unfortunately, at the moment I personally don’t have the resources to run more experiments. I’m contemplating returning to research within about a month, and at that point I may have a better chance to help you.

Yes, sorry, I forgot to comment on the training speed. I found a validation error curve of the runs in my notes (see below). The final experiment runs are the pink ones. The other curves are smaller models with varying amounts of labels and images and possibly different hyperparams. As you can see, you should reach accuracy 80% or so already after 5 epochs. (Why so quickly? Because the epochs are defined as going through the unlabeled examples once, and the training has seen each labeled example dozens of times by that point.) You should see steady improvement for the entire training. Actually the training did not even reach convergence; I had selected the number of epochs so that I can get the final results before the conference.

c3a4d232-b769-4b4a-aa4c-511df19f206c

Curious AI may still have some training logs etc. from the runs. I asked them if I could access that data and share the useful parts with you.

Apr 27 '18 07:04 tarvaina

Hi,

Thanks for your detailed response. The console log indicates that at Epoch 31 the prediction accuracy is still 0% for both prec@1 and prec@5. My training does not seem correct according to your graph.

Any updates on the training log? That would be most helpful. Thanks!

May 21 '18 18:05 benathi

Also if you could provide the pip requirements from pip freeze and/or conda environment details for conda list -e that would be most helpful! Thank you.

May 21 '18 18:05 benathi

@filipgrano @jrosti Any updates on this?

May 24 '18 10:05 tarvaina

@benathi I shared the training logs with you by email.

May 24 '18 11:05 tarvaina

If anyone has gotten ImageNet training to work, I'd be thrilled to get some pointers for getting started.

So far I've been getting:

Epoch: [0][0/7317]      Time 62.236 (62.236)    Data 13.409 (13.409)    Class 3.5255 (3.5255)   Cons 0.0281 (0.0281) Prec@1 0.000 (0.000)    Prec@5 1.250 (1.250)

[...]

AssertionError: Loss explosion: 146667.828125`

To answer @tarvaina's questions:

What command did you use to run the code?

python -m experiments.imagenet_valid

What commit of the repo did you use?

commit bd4313d5691f3ce4c30635e50fa207f49edf16fe
Author: Vik Kamath <[email protected]>
Date:   Thu May 31 14:09:17 2018 +0300

    Add license information

    - Addresses Issue #15

What versions of Python, PyTorch, NumPy and CUDA are you using?

Python 3.5.5, PyTorch 0.3.1.post2, NumPy 1.14.3, CUDA 9.0

What GPUs are you using?

8 v100's

How long did the network train before the explosion?

Immediately

Nov 20 '18 15:11 avital

I also have not been able to get ImageNet training to run successfully. I tried switching to MSE consistency loss instead of KL and I didn't get any loss explosion -- however, the accuracy at the end is not high, only around 84%.

@tarvaina If you could please look into it that would be great.

Nov 24 '18 19:11 benathi

Thanks for the reports and sorry for the problems. I will try to find (personal and computational) time to look into this next week.

Nov 25 '18 16:11 tarvaina

@avital Hi, I have come across with the same problem. The loss exploded immediately after the first iteration. Have you find a way to solve that?

Dec 17 '18 08:12 lukk47

I updated the code to be compatible with pytorch1.0.0 and the loss explosion error never showed up again. I think the issue may result from different reduction methods between various pytorch versions.

Jan 17 '19 10:01 lukk47

Hi,

Thanks for your detailed response. The console log indicates that at Epoch 31 the prediction accuracy is still 0% for both prec@1 and prec@5. My training does not seem correct according to your graph.

Any updates on the training log? That would be most helpful. Thanks!

@benathi I met the same situation. The pred@1 and pred@5 keeps 0% even at 40 epoch. How did you solve that? Thanks!

Jan 22 '19 02:01 lukk47

mean-teacher mean-teacher copied to clipboard

ImageNet Training Loss Very High (Error)

mean-teacher
mean-teacher copied to clipboard