pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Validation Loss Exploding to 10^5

Open lsaeuro opened this issue 2 years ago • 1 comments

🐛 Bug

Validation loss is exploding and I figured out that the problem is already in the forward method of my model. During Validation , the input tensors after a batch norm or an activation function, does not change a lot, while during training it is rescaled. image

I found something similar here in the Lightning Forum. Are there some particular modes in the validation loss that I should turn on or off?

Environment

  • CUDA:
    • GPU:
      • NVIDIA RTX A6000
    • available: True
    • version: 10.2
  • Packages:
    • numpy: 1.23.0
    • pyTorch_debug: False
    • pyTorch_version: 1.11.0+cu102
    • pytorch-lightning: 1.7.0
    • tqdm: 4.64.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.8.13
    • version: #44~20.04.1-Ubuntu SMP Fri Jun 24 13:27:29 UTC 2022

lsaeuro avatar Aug 08 '22 12:08 lsaeuro

During Validation , the input tensors after a batch norm or an activation function, does not change a lot, while during training it is rescaled.

Yes, this is the desired behavior. Lightning calls model.eval(), which disables recomputing the running stats in the layers like batch norm.

I suggest the old sanity check: Use the same data for training and validation, and exactly the same code for training_step and validation_step. Then, you should see the same loss value on the validation set as on the training set. Then, switch to a validation set that has different samples than in training. Now, if you see the validation loss exploding, you will know the reason: overfitting. Otherwise, if this experiment does not show the expected behaviors, you have a bug in the model.

For further assistance with model training, please open a discussion in our community page. If you believe a bug exists, please provide code to reproduce what you are seeing. Thanks

awaelchli avatar Aug 08 '22 17:08 awaelchli

Thank you @awaelchli , indeed this was the problem :

During Validation , the input tensors after a batch norm or an activation function, does not change a lot, while during training it is rescaled.

I solved it setting in the BatchNorm2D track_running_stats = False , but actually I don't know if this is a wrong work around, I will also try to do what you suggested. Thank you!

lsaeuro avatar Aug 10 '22 15:08 lsaeuro

Let me know if you have further questions regarding this. Instead of setting track_running_stats = False, alternatively you could also try to tune the momentum parameter on the batchnorm layer.

awaelchli avatar Aug 15 '22 11:08 awaelchli