stereo-transformer icon indicating copy to clipboard operation
stereo-transformer copied to clipboard

Evaluation of px or 5% Error onKITTI 2015.(table 4)

Open Miaowei-HNU opened this issue 2 years ago • 4 comments

Hello, I would like to ask if Table 4 is the model uploaded to KITTI website for testing? If not, how do I calculate them, and does bg refer to the occluded area, and does fg refer to the non-occluded area? image

Miaowei-HNU avatar Jul 29 '22 02:07 Miaowei-HNU

Hi @Miaowei-HNU these results are from KITTI test data, calculated by KITTI website.

‘bg’ refers to background. ‘fg’ refers to foreground.

mli0603 avatar Jul 29 '22 15:07 mli0603

Thank you for your reply

Miaowei-HNU avatar Jul 30 '22 02:07 Miaowei-HNU

Hi @mli0603 ,I feel that my fine-tuning result is close to yours, but the L1_raw is always very high. Is L1_raw necessary? It can be seen from the code that the difference between L1_raw and L1 is disp_pred with different resolutions. image

Miaowei-HNU avatar Aug 03 '22 04:08 Miaowei-HNU

Hi @Miaowei-HNU , L1-raw is the metric of the cross-attention raw disparity at a lower resolution, which ideally should be low similarly to L1. In KITTI 2015 however, we have identified that the occlusion mask is ill-posed (our follow up paper in ECCV). Thus, the large error you see is mostly in the occlusion region (you can also visualize the raw disparity to see what is going on).

The context adjustment layer learns to smooth out the occlusion errors in raw disparity map, thus leading to a much lower L1 error in the final estimation.

What does this mean? KITTI 2015 gives an unfair evaluation against our approach and STTR has to unlearn the "correct" estimation from transformer and learns the "incorrect" estimation from the context-adjustment layer.

I hope this helps.

mli0603 avatar Aug 10 '22 13:08 mli0603