mmsegmentation
mmsegmentation copied to clipboard
mIOU and acc looks normal at first thousands of epoch but all become 0 after like 80000 epoch
I tried segformer b0, b0-tiny, b1 to train several custom dataset(different purpose using different model on different datasets). All have a problem. At the first several epoch, for example, 20000 epoch, or 80000 epoch. It keep improving, but after that, the model just output 0 on every class. The miou and acc was still improving before that happen, and i believe the model is still ubderfitting. However, the zeors output stops me from continueing. And it doesn't looks like vanishing gradient, is it?
I suggest you try to save the checkpoints at different times, like 1k epoch 10k epoch 20k epoch, and check whether the model weights are abnormal.
I suggest you try to save the checkpoints at different times, like 1k epoch 10k epoch 20k epoch, and check whether the model weights are abnormal.
Thanks~ I did. I just checked the checkpoints pth file and compare the weights between the normal output epoch and the abnormal one, it looks ok to me. Please look the following test results.
iter_100000 on test dataset: +-------------+-------+-------+ | Class | IoU | Acc | +-------------+-------+-------+ | background | 96.83 | 98.7 | | white_lane | 34.82 | 36.14 | | yellow_lane | 59.98 | 70.05 | | trans_lane | 33.47 | 47.82 | | cone | 20.61 | 67.83 | | car | 0.0 | 0.0 | | Truck | 74.57 | 89.85 | +-------------+-------+-------+ Summary:
+--------+-------+-------+-------+ | Scope | mIoU | mAcc | aAcc | +--------+-------+-------+-------+ | global | 45.76 | 58.63 | 96.96 | +--------+-------+-------+-------+
iter_100000 on train dataset: +-------------+-------+-------+ | Class | IoU | Acc | +-------------+-------+-------+ | background | 96.9 | 98.72 | | white_lane | 34.86 | 36.16 | | yellow_lane | 60.27 | 70.55 | | trans_lane | 32.86 | 47.55 | | cone | 25.25 | 68.03 | | car | 0.16 | 0.16 | | Truck | 74.56 | 89.67 | +-------------+-------+-------+ Summary:
+--------+-------+-------+-------+ | Scope | mIoU | mAcc | aAcc | +--------+-------+-------+-------+ | global | 46.41 | 58.69 | 97.02 | +--------+-------+-------+-------+ iter_110000 on test dataset: +-------------+-------+-------+ | Class | IoU | Acc | +-------------+-------+-------+ | background | 63.53 | 64.21 | | white_lane | 0.0 | 0.0 | | yellow_lane | 0.0 | 0.0 | | trans_lane | 0.73 | 100.0 | | cone | 0.0 | 0.0 | | car | 0.0 | 0.0 | | Truck | 49.62 | 50.26 | +-------------+-------+-------+ Summary:
+--------+-------+-------+-------+ | Scope | mIoU | mAcc | aAcc | +--------+-------+-------+-------+ | global | 16.27 | 30.64 | 61.83 | +--------+-------+-------+-------+
iter_110000 on train dataset: +-------------+-------+-------+ | Class | IoU | Acc | +-------------+-------+-------+ | background | 63.74 | 64.42 | | white_lane | 0.0 | 0.0 | | yellow_lane | 0.0 | 0.0 | | trans_lane | 0.66 | 99.94 | | cone | 0.0 | 0.0 | | car | 0.0 | 0.0 | | Truck | 49.04 | 49.46 | +-------------+-------+-------+ Summary:
+--------+-------+-------+-------+ | Scope | mIoU | mAcc | aAcc | +--------+-------+-------+-------+ | global | 16.21 | 30.54 | 62.01 | +--------+-------+-------+-------+
The performance of iter_110000 on both train set and test set went down to zero. I want to know why and how to improve it. Thank you for your help!
There might be imbalance labels in your dataset, and some labels like background
and trans_lane
are in the large proportion in one image. I suggest you use class_weight in loss to alleviate this problem.
Moreover, you might google for a solution as this classical problem has been explored for a long time by many outstanding scientists.