Stratified-Transformer Overfitting on S3DIS during training

Hi, I am training stratified transformer with the same settings as yours except that I used max_batch_points: 600000 and batch_size: 24 with 3 GPUs. I got max mIOU of 0.70111 on validation. But the validation loss is as follows: Capture

The training loss is decreasing smoothly on the other hand. Why is the model overfitting so bad?

Oct 12 '22 15:10 QasimKhan5x

It is weird to see the severe overfitting problem. Can you try the default configuration to see whether the problem still exists? I wonder maybe because the training parameters will have some side-effect on that. BTW, if you get 0.701 validation mIoU, you can already reproduce our results reported on the paper using the test.py script.

Oct 13 '22 13:10 X-Lai

Ok I'll try the default config and get back to you. BTW you used a batch size of 8 with 4 GPUs, so each GPU receives a batch size of 2. If I were to use 3 GPUs, I ought to use a batch size of 6 to match the default, right?

Oct 13 '22 13:10 QasimKhan5x

But if you use different batch size, you also should modify learning rate and other related parameters accordingly.

Oct 15 '22 10:10 X-Lai

I kept the batch size at 8 and used 4 GPUs. The validation mIoU seems to peak around 70 again but the same validation loss curve is observed. Did you observe this during training?

Oct 15 '22 10:10 QasimKhan5x

Not yet. I have run multiple times, and often got the best validation model in the last 20 epochs, and the val miou continues no to decrease as shown in our training log (released in the README.md file).

Oct 15 '22 11:10 X-Lai

I am getting the following curves (this is my second run with 8 batch size)

As you can see, everything is performing good except the validation loss. My question is whether this is normal behavior or is it unexpected?

Oct 15 '22 11:10 QasimKhan5x

Thank you for pointing out this observation. Actually, I am not aware of this before. But here are some hints. 1. Although the loss_val could increase later, it seems the miou_val continues to increase. The difference between these two metrics is a little weird. 2. Before 60 epochs, learning rate is fixed, so this may cause the overfitting issue. But whether the trainning is normal, we still need to see the curve of the last 40 epochs.

Oct 15 '22 11:10 X-Lai

But overall, I think once the validation miou is strong enough, the training is normal, even though there is something unstable within.

Oct 15 '22 11:10 X-Lai

@X-Lai Thank you for your work！ I noticed that you did not mention the number of parameters in your paper. Can you please tell me the number of parameters for the model on s3dis?

Oct 18 '22 10:10 xindeng98

Stratified-Transformer Stratified-Transformer copied to clipboard

Overfitting on S3DIS during training

Stratified-Transformer
Stratified-Transformer copied to clipboard