joytag On validation metrics and thresholds

On validation metrics and thresholds

Open SmilingWolf opened this issue 1 year ago • 2 comments

First of all, nice job!

I noticed in the validation arena you're using my suggested thresholds for my models, and a "default" one for yours. That's doing your work a disservice. I think a fairer way to compare the models would be to try and find some fixed performance point, and see how the other metrics fare.

For my models, for example, I used to choose (by bisection) a threshold where micro averaged recall and precision matched: if both were higher than the last model then I had a better model. You could do the same, or bisect towards a threshold that gives a desired precision and evaluate recall for example. This also has the side effect of being more fair to augmentations like mixup, that skew predictions confidence towards lower values.

If I may go on a slight tangent about the discrepancy between my stated scores and the ones in the Arena: I used to use micro averaging, while you're calculating macro averages. Definitely keep using macro averaging for the metrics, I started using it too in my newer codebase over at https://github.com/SmilingWolf/JAX-CV (posting the repo in case you consider using it if you decide to apply to TRC).

Dec 25 '23 07:12 SmilingWolf

joytag joytag copied to clipboard

On validation metrics and thresholds

joytag
joytag copied to clipboard