Stratified-Transformer icon indicating copy to clipboard operation
Stratified-Transformer copied to clipboard

Overfitting on S3DIS during training

Open QasimKhan5x opened this issue 2 years ago • 9 comments

Hi, I am training stratified transformer with the same settings as yours except that I used max_batch_points: 600000 and batch_size: 24 with 3 GPUs. I got max mIOU of 0.70111 on validation. But the validation loss is as follows: Capture

The training loss is decreasing smoothly on the other hand. Why is the model overfitting so bad?

QasimKhan5x avatar Oct 12 '22 15:10 QasimKhan5x

It is weird to see the severe overfitting problem. Can you try the default configuration to see whether the problem still exists? I wonder maybe because the training parameters will have some side-effect on that. BTW, if you get 0.701 validation mIoU, you can already reproduce our results reported on the paper using the test.py script.

X-Lai avatar Oct 13 '22 13:10 X-Lai

Ok I'll try the default config and get back to you. BTW you used a batch size of 8 with 4 GPUs, so each GPU receives a batch size of 2. If I were to use 3 GPUs, I ought to use a batch size of 6 to match the default, right?

QasimKhan5x avatar Oct 13 '22 13:10 QasimKhan5x

But if you use different batch size, you also should modify learning rate and other related parameters accordingly.

X-Lai avatar Oct 15 '22 10:10 X-Lai

I kept the batch size at 8 and used 4 GPUs. The validation mIoU seems to peak around 70 again but the same validation loss curve is observed. Did you observe this during training?

QasimKhan5x avatar Oct 15 '22 10:10 QasimKhan5x

Not yet. I have run multiple times, and often got the best validation model in the last 20 epochs, and the val miou continues no to decrease as shown in our training log (released in the README.md file).

X-Lai avatar Oct 15 '22 11:10 X-Lai

I am getting the following curves (this is my second run with 8 batch size)

image

As you can see, everything is performing good except the validation loss. My question is whether this is normal behavior or is it unexpected?

QasimKhan5x avatar Oct 15 '22 11:10 QasimKhan5x

Thank you for pointing out this observation. Actually, I am not aware of this before. But here are some hints. 1. Although the loss_val could increase later, it seems the miou_val continues to increase. The difference between these two metrics is a little weird. 2. Before 60 epochs, learning rate is fixed, so this may cause the overfitting issue. But whether the trainning is normal, we still need to see the curve of the last 40 epochs.

X-Lai avatar Oct 15 '22 11:10 X-Lai

But overall, I think once the validation miou is strong enough, the training is normal, even though there is something unstable within.

X-Lai avatar Oct 15 '22 11:10 X-Lai

@X-Lai Thank you for your work! I noticed that you did not mention the number of parameters in your paper. Can you please tell me the number of parameters for the model on s3dis?

xindeng98 avatar Oct 18 '22 10:10 xindeng98