pytorch-lightning Validation Loss Exploding to 10^5

🐛 Bug

Validation loss is exploding and I figured out that the problem is already in the forward method of my model. During Validation , the input tensors after a batch norm or an activation function, does not change a lot, while during training it is rescaled.

I found something similar here in the Lightning Forum. Are there some particular modes in the validation loss that I should turn on or off?

Environment

CUDA:
- GPU:
  - NVIDIA RTX A6000
- available: True
- version: 10.2
Packages:
- numpy: 1.23.0
- pyTorch_debug: False
- pyTorch_version: 1.11.0+cu102
- pytorch-lightning: 1.7.0
- tqdm: 4.64.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.8.13
- version: #44~20.04.1-Ubuntu SMP Fri Jun 24 13:27:29 UTC 2022

Aug 08 '22 12:08 lsaeuro

During Validation , the input tensors after a batch norm or an activation function, does not change a lot, while during training it is rescaled.

Yes, this is the desired behavior. Lightning calls model.eval(), which disables recomputing the running stats in the layers like batch norm.

I suggest the old sanity check: Use the same data for training and validation, and exactly the same code for training_step and validation_step. Then, you should see the same loss value on the validation set as on the training set. Then, switch to a validation set that has different samples than in training. Now, if you see the validation loss exploding, you will know the reason: overfitting. Otherwise, if this experiment does not show the expected behaviors, you have a bug in the model.

For further assistance with model training, please open a discussion in our community page. If you believe a bug exists, please provide code to reproduce what you are seeing. Thanks

Aug 08 '22 17:08 awaelchli

Thank you @awaelchli , indeed this was the problem :

During Validation , the input tensors after a batch norm or an activation function, does not change a lot, while during training it is rescaled.

I solved it setting in the BatchNorm2D track_running_stats = False , but actually I don't know if this is a wrong work around, I will also try to do what you suggested. Thank you!

Aug 10 '22 15:08 lsaeuro

Let me know if you have further questions regarding this. Instead of setting track_running_stats = False, alternatively you could also try to tune the momentum parameter on the batchnorm layer.

Aug 15 '22 11:08 awaelchli

pytorch-lightning pytorch-lightning copied to clipboard

Validation Loss Exploding to 10^5

🐛 Bug

Environment

pytorch-lightning
pytorch-lightning copied to clipboard