movement Calibrate confidence scores

Is your feature request related to a problem? Please describe. We usually interpret confidence scores as a proxy for the error estimate in the keypoints prediction. However, it is well known that neural networks tend to be "overly confident" in their predictions. For example, for the multiclass classification case, reference [1] says:

the softmax output of modern neural networks, which typically is interpreted as a categorical distribution in classification, is poorly calibrated.

It would be very useful to be able to produce calibrated confidence scores of the keypoint predictions. That would allow us to compare results across frameworks, better filter high/low confidence values, and better interpret model performance.

Describe the solution you'd like We could consider having a method in movement that calibrates confidence scores.

We could implement something similar to what keypoint-moseq does. They have functionality to fit a linear model to the relationship between keypoint error and confidence:

[the function] creates a widget for interactive annotation in jupyter lab. Users mark correct keypoint locations for a sequence of frames, and a regression line is fit to the log(confidence), log(error) pairs obtained through annotation. The regression coefficients are used during modeling to set a prior on the noise level for each keypoint on each frame.

Describe alternatives you've considered \

Additional context Nice explanations for the case of classification (note that in pose estimation we do a regression problem, not a classification one):

https://geoffpleiss.com/blog/nn_calibration.html
https://scikit-learn.org/stable/modules/calibration.html

From a quick search I found:

[1] this paper, on the calibration of human pose estimation. They propose a neural network that learns specific adjustments for a pose estimator. Seems out of scope for movement but may be a useful read to understand the problem better.
this paper for object detection, could be similarly useful.

Aug 16 '24 16:08 sfmig

This euroscipy tutorial may be useful for this work

Aug 27 '24 15:08 sfmig

Note that pose estimation is a regression problem not a classification one - we probably want to look into ways of applying the same to a regression problem in a reasonable way.

For example, maybe we can "transform" the problem into a classification one by establishing that a keypoint is correctly predicted if close enough to the ground truth label. This seems reasonable at a first glance?

Mar 03 '25 15:03 sfmig

I think transforming the problem into a classification one is a simple and intuitive approach, but it might introduce additional hyperparameters (such as the threshold for determining whether a keypoint is "close enough" to the ground truth). This could also lead to some loss of information.

An alternative approach could be to first use regression to model the relationship between the model-provided confidence score and the actual prediction error. Since prediction error itself is a calibration target, this would allow for better comparisons across different pose estimation models.

If a normalized confidence score within $[0,1]$ is needed, we could design a transformation function—for instance, something like

$c' = \frac{1}{\text{error} + 1}$

This way, the predicted error can be mapped to a calibrated confidence score in a meaningful way.

At this point, these ideas are based on conceptual reasoning. To determine which calibration method works best, I think it would be valuable to construct a benchmark. By evaluating downstream tasks that rely on confidence scores, we can better assess which calibration method leads to improved results.

Mar 23 '25 19:03 Angelneer926

Hi @Angelneer926,

Thanks for sharing these thoughts, benchmarking sounds like a good idea - could you expand on the kind of downstream tasks you are thinking on benchmarking against?

Mar 25 '25 11:03 sfmig

I’ve been reviewing the movement code to understand where confidence scores are currently being utilized. So far, I’ve noticed that in the interpolation process, low-confidence coordinates are filtered out before interpolation is applied. If we have access to the true coordinates, one approach could be to assess the performance of the confidence scores by comparing the difference between the interpolated results and the true values.

Additionally, I’m thinking of using anomaly detection or behavior classification accuracy as a benchmark for evaluating confidence scores. Another option could be to leverage high-confidence keypoints during frame-to-frame matching and assess the confidence score quality based on pose tracking accuracy.

Mar 26 '25 01:03 Angelneer926