pytorch-lightning
pytorch-lightning copied to clipboard
Validation Loss Exploding to 10^5
🐛 Bug
Validation loss is exploding and I figured out that the problem is already in the forward method of my model.
During Validation , the input tensors after a batch norm or an activation function, does not change a lot, while during training it is rescaled.
I found something similar here in the Lightning Forum. Are there some particular modes in the validation loss that I should turn on or off?
Environment
- CUDA:
- GPU:
- NVIDIA RTX A6000
- available: True
- version: 10.2
- GPU:
- Packages:
- numpy: 1.23.0
- pyTorch_debug: False
- pyTorch_version: 1.11.0+cu102
- pytorch-lightning: 1.7.0
- tqdm: 4.64.0
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.13
- version: #44~20.04.1-Ubuntu SMP Fri Jun 24 13:27:29 UTC 2022
During Validation , the input tensors after a batch norm or an activation function, does not change a lot, while during training it is rescaled.
Yes, this is the desired behavior. Lightning calls model.eval()
, which disables recomputing the running stats in the layers like batch norm.
I suggest the old sanity check: Use the same data for training and validation, and exactly the same code for training_step and validation_step. Then, you should see the same loss value on the validation set as on the training set. Then, switch to a validation set that has different samples than in training. Now, if you see the validation loss exploding, you will know the reason: overfitting. Otherwise, if this experiment does not show the expected behaviors, you have a bug in the model.
For further assistance with model training, please open a discussion in our community page. If you believe a bug exists, please provide code to reproduce what you are seeing. Thanks
Thank you @awaelchli , indeed this was the problem :
During Validation , the input tensors after a batch norm or an activation function, does not change a lot, while during training it is rescaled.
I solved it setting in the BatchNorm2D track_running_stats = False , but actually I don't know if this is a wrong work around, I will also try to do what you suggested. Thank you!
Let me know if you have further questions regarding this. Instead of setting track_running_stats = False
, alternatively you could also try to tune the momentum
parameter on the batchnorm layer.