mmsegmentation icon indicating copy to clipboard operation
mmsegmentation copied to clipboard

mIOU and acc looks normal at first thousands of epoch but all become 0 after like 80000 epoch

Open mengxia1994 opened this issue 2 years ago • 3 comments

I tried segformer b0, b0-tiny, b1 to train several custom dataset(different purpose using different model on different datasets). All have a problem. At the first several epoch, for example, 20000 epoch, or 80000 epoch. It keep improving, but after that, the model just output 0 on every class. The miou and acc was still improving before that happen, and i believe the model is still ubderfitting. However, the zeors output stops me from continueing. And it doesn't looks like vanishing gradient, is it?

mengxia1994 avatar Aug 03 '22 10:08 mengxia1994

I suggest you try to save the checkpoints at different times, like 1k epoch 10k epoch 20k epoch, and check whether the model weights are abnormal.

MeowZheng avatar Aug 03 '22 16:08 MeowZheng

I suggest you try to save the checkpoints at different times, like 1k epoch 10k epoch 20k epoch, and check whether the model weights are abnormal.

Thanks~ I did. I just checked the checkpoints pth file and compare the weights between the normal output epoch and the abnormal one, it looks ok to me. Please look the following test results.

iter_100000 on test dataset: +-------------+-------+-------+ | Class | IoU | Acc | +-------------+-------+-------+ | background | 96.83 | 98.7 | | white_lane | 34.82 | 36.14 | | yellow_lane | 59.98 | 70.05 | | trans_lane | 33.47 | 47.82 | | cone | 20.61 | 67.83 | | car | 0.0 | 0.0 | | Truck | 74.57 | 89.85 | +-------------+-------+-------+ Summary:

+--------+-------+-------+-------+ | Scope | mIoU | mAcc | aAcc | +--------+-------+-------+-------+ | global | 45.76 | 58.63 | 96.96 | +--------+-------+-------+-------+

iter_100000 on train dataset: +-------------+-------+-------+ | Class | IoU | Acc | +-------------+-------+-------+ | background | 96.9 | 98.72 | | white_lane | 34.86 | 36.16 | | yellow_lane | 60.27 | 70.55 | | trans_lane | 32.86 | 47.55 | | cone | 25.25 | 68.03 | | car | 0.16 | 0.16 | | Truck | 74.56 | 89.67 | +-------------+-------+-------+ Summary:

+--------+-------+-------+-------+ | Scope | mIoU | mAcc | aAcc | +--------+-------+-------+-------+ | global | 46.41 | 58.69 | 97.02 | +--------+-------+-------+-------+ iter_110000 on test dataset: +-------------+-------+-------+ | Class | IoU | Acc | +-------------+-------+-------+ | background | 63.53 | 64.21 | | white_lane | 0.0 | 0.0 | | yellow_lane | 0.0 | 0.0 | | trans_lane | 0.73 | 100.0 | | cone | 0.0 | 0.0 | | car | 0.0 | 0.0 | | Truck | 49.62 | 50.26 | +-------------+-------+-------+ Summary:

+--------+-------+-------+-------+ | Scope | mIoU | mAcc | aAcc | +--------+-------+-------+-------+ | global | 16.27 | 30.64 | 61.83 | +--------+-------+-------+-------+

iter_110000 on train dataset: +-------------+-------+-------+ | Class | IoU | Acc | +-------------+-------+-------+ | background | 63.74 | 64.42 | | white_lane | 0.0 | 0.0 | | yellow_lane | 0.0 | 0.0 | | trans_lane | 0.66 | 99.94 | | cone | 0.0 | 0.0 | | car | 0.0 | 0.0 | | Truck | 49.04 | 49.46 | +-------------+-------+-------+ Summary:

+--------+-------+-------+-------+ | Scope | mIoU | mAcc | aAcc | +--------+-------+-------+-------+ | global | 16.21 | 30.54 | 62.01 | +--------+-------+-------+-------+

The performance of iter_110000 on both train set and test set went down to zero. I want to know why and how to improve it. Thank you for your help!

mengxia1994 avatar Aug 04 '22 06:08 mengxia1994

There might be imbalance labels in your dataset, and some labels like background and trans_lane are in the large proportion in one image. I suggest you use class_weight in loss to alleviate this problem.

Moreover, you might google for a solution as this classical problem has been explored for a long time by many outstanding scientists.

MeowZheng avatar Aug 05 '22 06:08 MeowZheng