ast icon indicating copy to clipboard operation
ast copied to clipboard

Validation loss vs Training loss in AudioSet training

Open Tomlevron opened this issue 4 years ago • 7 comments

Hi!

First of all i would like to thank you for sharing with everyone your amazing work! Truly inspiring and fascinating work you shard with us.

I have a question regarding the differences of the training loss and the validation loss. It seems that the validation loss is much higher than the training loss, is that make sense? isn't it overfitting?

I also tried to fine tune the Audioset trained model for my data and is showed the same differences (with and without augmentations).

Here is an example from the logs: test-full-f10-t10-pTrue-b12-lr1e-5/log_2090852.txt:

train_loss: 0.011128
valid_loss: 0.693989

I'm still new to deep learning so maybe I'm missing something.

Thank you!

Tomlevron avatar Oct 07 '21 09:10 Tomlevron

Thanks for your interest.

I think it is not an overfitting issue as you should also see a performance drop in mAP or accuracy on the validation set if the model is overfitted. I think the reason is that we added a Sigmoid function on top of the output of the model in the inference stage (but not in the training stage) before loss computation to make sure mAP/accuracy is calculated correctly. It changes the validation loss. See here.

-Yuan

YuanGongND avatar Oct 08 '21 17:10 YuanGongND

Wouldn't it be wrong to train with Softmax and use Sigmoid for mAP? Using Softmax instead of Sigmoid gives a higher mAP value.

hbellafkir avatar Oct 22 '21 13:10 hbellafkir

Could you elaborate on this point?

I think we did not use softmax during training, the reason why we added an extra Sigmoid in inference but not in training is that BCEWithLogitsLoss already includes one Sigmoid.

YuanGongND avatar Oct 22 '21 14:10 YuanGongND

In the case of CrossEntropyLoss, the loss is calculated with Softmax (here) as it is included in the CrossEntropyLoss operation. It is not correct to use Sigmoid for inference when CrossEntropyLoss is used in training for my understanding. on a custom dataset that I use, switching from Sigmoid to Softmax results in a higher mAP value during inference.

hbellafkir avatar Oct 23 '21 09:10 hbellafkir

In the case of CrossEntropyLoss, the loss is calculated with Softmax (here) as it is included in the CrossEntropyLoss operation. It is not correct to use Sigmoid for inference when CrossEntropyLoss is used in training for my understanding. on a custom dataset that I use, switching from Sigmoid to Softmax results in a higher mAP value during inference.

@YuanGongND any thoughts on this?

hbellafkir avatar Oct 26 '21 10:10 hbellafkir

Yes - I think you can skip the Sigmoid in inference. That was just used to make training/inference consistent for the multi-label classification (i.e., one audio has more than one label) tasks.

When you use CrossEntropyLoss, I assume you have a single-label dataset, using Softmax here might improve mAP, but won't improve accuracy, but mAP is less important for single-label classification, that's why we use accuracy in the ESC-50 and SpeechCommands recipe.

For multi-label classification, adding Sigmoid won't change mAP either as Sigmoid is monotonic, so I think you can also remove that, but that could impact the ensemble performance.

YuanGongND avatar Oct 26 '21 17:10 YuanGongND

After some investigation, it seems to be a logging bug. The train and eval loss difference is over-estimated in the code.

In traintest.py, the loss_meter is cleaned up every epoch, but the average is printed out every 1000 iterations. So the large loss value at early iterations accumulates.

Changing 'loss_meter.avg' to 'loss_meter.val' at here can alleviate this problem. But I would suggest doing an offline loss evaluation (i.e., check the training loss using the best checkpoint model after the training process finishes), that would be the most accurate solution.

YuanGongND avatar Mar 04 '22 21:03 YuanGongND