OpenOOD A bit confused about the `Gram` implementation

https://github.com/Jingkang50/OpenOOD/blob/main/openood/postprocessors/gram_postprocessor.py#L115 I wonder why the dev is used as conf directly? Isn't the larger the deviations the more likely the sample would be an OOD?

I checked the original implementation here: https://github.com/VectorInstitute/gram-ood-detection/blob/master/ResNet_Cifar10.ipynb I found it actually using the negative of deviations when calculating the metrics.

But the metric looks ok, so I am quite confused. Am I missing sth?

Aug 10 '23 13:08 sosuneko

It's indeed confusing. Like you said we are getting >50% AUROC for GRAM in all our experiments. Applying the negation seems the correct thing to do, but will lead to <50% AUROC. I will try to do some investigation when I'm available.

Aug 10 '23 17:08 zjysteven

Hi, I worked on the gram matrix method and I seem to have fixed the implementation here

The results on CIFAR10 obtained with the current gram-matrix implementation are as follows:

                FPR@95         AUROC       AUPR_IN       AUPR_OUT           ACC
cifar100    91.68 ± 2.24  58.33 ± 4.49  56.74 ± 3.87   59.24 ± 4.62  95.06 ± 0.30
tin         90.06 ± 1.59  58.98 ± 5.19  61.65 ± 3.75   55.89 ± 5.56  95.06 ± 0.30
nearood     90.87 ± 1.91  58.66 ± 4.83  59.19 ± 3.79   57.57 ± 5.09  95.06 ± 0.30
mnist       70.30 ± 8.96  72.64 ± 2.34  36.92 ± 8.23   93.36 ± 1.21  95.06 ± 0.30
svhn       33.91 ± 17.35  91.52 ± 4.45  82.40 ± 8.85   96.62 ± 1.81  95.06 ± 0.30
texture     94.64 ± 2.71  62.34 ± 8.27  67.93 ± 5.60  55.93 ± 10.76  95.06 ± 0.30
places365   90.49 ± 1.93  60.44 ± 3.41  26.94 ± 2.62   85.64 ± 1.31  95.06 ± 0.30
farood      72.34 ± 6.73  71.74 ± 3.20  53.55 ± 4.74   82.89 ± 3.14  95.06 ± 0.30

With the corrected implementation, I was able to get:

                 FPR@95         AUROC       AUPR_IN      AUPR_OUT           ACC
cifar100   61.61 ± 0.82  84.61 ± 0.20  84.21 ± 0.20  83.75 ± 0.32  95.06 ± 0.30
tin        51.99 ± 1.16  87.16 ± 0.52  88.46 ± 0.41  84.34 ± 0.83  95.06 ± 0.30
nearood    56.80 ± 0.62  85.88 ± 0.35  86.33 ± 0.28  84.04 ± 0.56  95.06 ± 0.30
mnist       7.31 ± 1.02  97.57 ± 0.49  94.37 ± 0.85  99.48 ± 0.14  95.06 ± 0.30
svhn        6.67 ± 0.29  98.64 ± 0.02  96.73 ± 0.06  99.48 ± 0.04  95.06 ± 0.30
texture    14.86 ± 0.71  96.95 ± 0.11  97.99 ± 0.12  95.53 ± 0.09  95.06 ± 0.30
places365  42.81 ± 2.19  89.56 ± 0.80  73.32 ± 1.45  96.53 ± 0.33  95.06 ± 0.30
farood     17.91 ± 0.70  95.68 ± 0.25  90.60 ± 0.35  97.75 ± 0.09  95.06 ± 0.30

When I used the same checkpoints with the code referred to by @SauceCat , I was able to get marginally higher results for SVHN but did not test other datasets. The new code is not fully polished but seems to be working as expected. I also did not run experiments on datasets other CIFAR10 as InD.

Thank you for the OpenOOD benchmark and considering the gram matrix method for inclusion in the benchmark!

Feb 19 '24 08:02 chandramouli-sastry

@chandramouli-sastry Thanks for sharing the results, and glad to see the much improved numbers with the updated implementation. Would you mind opening a pull request for this? Meanwhile we will update the gram matrix results in both the paper and the leaderboard.

Feb 20 '24 22:02 zjysteven

Thank you! I just created a pull request for your review.

Feb 24 '24 05:02 chandramouli-sastry