joytag
joytag copied to clipboard
On validation metrics and thresholds
First of all, nice job!
I noticed in the validation arena you're using my suggested thresholds for my models, and a "default" one for yours. That's doing your work a disservice. I think a fairer way to compare the models would be to try and find some fixed performance point, and see how the other metrics fare.
For my models, for example, I used to choose (by bisection) a threshold where micro averaged recall and precision matched: if both were higher than the last model then I had a better model. You could do the same, or bisect towards a threshold that gives a desired precision and evaluate recall for example. This also has the side effect of being more fair to augmentations like mixup, that skew predictions confidence towards lower values.
If I may go on a slight tangent about the discrepancy between my stated scores and the ones in the Arena: I used to use micro averaging, while you're calculating macro averages. Definitely keep using macro averaging for the metrics, I started using it too in my newer codebase over at https://github.com/SmilingWolf/JAX-CV (posting the repo in case you consider using it if you decide to apply to TRC).