Stratified-Transformer
Stratified-Transformer copied to clipboard
Overfitting on S3DIS during training
Hi, I am training stratified transformer with the same settings as yours except that I used max_batch_points: 600000
and batch_size: 24
with 3 GPUs. I got max mIOU of 0.70111 on validation. But the validation loss is as follows:
The training loss is decreasing smoothly on the other hand. Why is the model overfitting so bad?
It is weird to see the severe overfitting problem. Can you try the default configuration to see whether the problem still exists? I wonder maybe because the training parameters will have some side-effect on that. BTW, if you get 0.701 validation mIoU, you can already reproduce our results reported on the paper using the test.py script.
Ok I'll try the default config and get back to you. BTW you used a batch size of 8 with 4 GPUs, so each GPU receives a batch size of 2. If I were to use 3 GPUs, I ought to use a batch size of 6 to match the default, right?
But if you use different batch size, you also should modify learning rate and other related parameters accordingly.
I kept the batch size at 8 and used 4 GPUs. The validation mIoU seems to peak around 70 again but the same validation loss curve is observed. Did you observe this during training?
Not yet. I have run multiple times, and often got the best validation model in the last 20 epochs, and the val miou continues no to decrease as shown in our training log (released in the README.md file).
I am getting the following curves (this is my second run with 8 batch size)
As you can see, everything is performing good except the validation loss. My question is whether this is normal behavior or is it unexpected?
Thank you for pointing out this observation. Actually, I am not aware of this before. But here are some hints. 1. Although the loss_val could increase later, it seems the miou_val continues to increase. The difference between these two metrics is a little weird. 2. Before 60 epochs, learning rate is fixed, so this may cause the overfitting issue. But whether the trainning is normal, we still need to see the curve of the last 40 epochs.
But overall, I think once the validation miou is strong enough, the training is normal, even though there is something unstable within.
@X-Lai Thank you for your work! I noticed that you did not mention the number of parameters in your paper. Can you please tell me the number of parameters for the model on s3dis?