BirdNET-Analyzer icon indicating copy to clipboard operation
BirdNET-Analyzer copied to clipboard

Clarification on v2 breaking change for sensitivity parameter, and practical implications for long-term monitoring projects?

Open cbalantic opened this issue 5 months ago • 6 comments

I'm interested in learning more about the breaking change in v2, "change how sensitivity works" #578 and #579. What does this change mean?

I see that the allowable sensitivity input values are now capped between 0.75 and 1.25. In contrast, in BirdNET-Analyzer v1, the allowed values were 0.5 to 1.5.

I also have a few practical questions relevant to anyone who does long-term annual monitoring and data processing through BirdNET:

  • If we used sensitivity = 1 in v1, should users expect no change in results when using sensitivity = 1 in v2? Based on my initial tests, it looks like sensitivity = 1 from BirdNET-Analyzer v1 produces the same results as sensitivity = 1 from BirdNET-Analyzer v2, so that's a relief! Please correct me if I'm wrong, or if that wasn't the intended effect.
  • Is there any way to maintain backward compatibility for sensitivity values that aren't 1? We have an active annual monitoring project where we use model version 2.4, BirdNET-Analyzer v1, and sensitivity = 1.5. Is there any equivalent sensitivity value that I could use to achieve the same results with BirdNET-Analyzer v2? Based on initial tests, and the fact that it's a breaking change, I'm suspecting the answer is no, but I hope I'm wrong!

cbalantic avatar Jul 29 '25 21:07 cbalantic

Ok, I'll probably add some extended explanation to the docs at some point, this is something people ask frequently. For now, here's a bit of context and our thought process:

The classification layer outputs logits—raw values indicating how strongly the model activates for each class (neuron activation—each neuron maps to one class). In multi-label tasks (where multiple classes can be present), we apply a sigmoid function to convert these logits into scores between 0 and 1. Logits below 0 map to scores below 0.5, and those above 0 to scores above 0.5.

Originally, the sensitivity setting changed the slope of the sigmoid: higher sensitivity flattened it, making the model more responsive to low activations, but also (counterintuitively) requiring stronger activation to reach very high scores (like 0.999).

In the new version, we instead shift the sigmoid left (for higher sensitivity) or right (for lower sensitivity), while keeping its slope fixed. This means:

Higher sensitivity = less activation needed to reach 0.5

Lower sensitivity = more activation needed

A sensitivity of 1.0 keeps the standard sigmoid and ensures compatibility with the old version. Aside from that, sensitivity values are incompatible and denote a very different calculation, that's why it's a breaking change.

kahst avatar Jul 30 '25 19:07 kahst

@kahst excellent explanation! Thanks for the clarification.

Does changing the "sensitivity" change the model performance (Precision/Recall), or just the relationship between the confidence scores and true underlying probability that window contains the predicted signal. For example, the performance (precision and recall) at confidence threshold A can now be achieved which confidence threshold B for a given signal when I change the "sensitivity"?

abfleishman avatar Jul 30 '25 20:07 abfleishman

Thanks, @kahst, that is a helpful technical background. In layperson's terms, it essentially sounds like it's making the sensitivity parameter behave more effectively in whichever direction the user is choosing.

Functionally, for any projects where we weren't using sensitivity = 1, this means we'll need to redo our manual segment review/verifications and compute new detection performance evaluation metrics in order to stay current with BirdNET Analyzer GUI v2 (even though we're still using model v2.4). We're exploring the use of BirdNET in production for annual monitoring, so verification effort has already been invested to choose defensible confidence thresholds for various projects. For anyone in this situation, it's important to understand that those manual review efforts would need to be redone for cases outside of sensitivity = 1. Thank you for the clarification and reply!

cbalantic avatar Jul 30 '25 21:07 cbalantic

@cbalantic

When you say:

this means we'll need to redo our manual segment review/verifications and compute new detection performance evaluation metrics

Given the background you provided, I don't believe this follows. As your manual review/verification effort has been to select a "defensible confidence threshold", the use of sensitivity values != 1 just changes the absolute value of that threshold. The key point here is that sensitivity adjustments do not change the order (or rank) of species' predictions. If Species A had a higher raw activation (logit) than Species B, its final score will always be higher regardless of sensitivity settings1.

Whilst @kahst is correct in saying it's a breaking change, that is because you will not be able to replicate the same scores across the range of 'confidence' scores using old vs. new sensitivity calculations. However, results at any given threshold (i.e. predictions meeting that threshold) at sensitivity = 1 can be replicated exactly with an adjusted threshold when sensitivity != 1.

You just need to figure out the adjustment needed.

Incidentally, it also follows that sensitivity does not improve or degrade model performance per se, it simply changes model performance at a given threshold. And that leads me to wonder, do you really need to change sensitivity when you'll get the same outcome with a different threshold?

As you say the sensitivity you have used with the old formula was 1.5, the following will give you the equivalent threshold to use (and the same results) as if you had not used a sensitivity value, or more precisely, used the default: sensitivity = 1:


import numpy as np

sensitivity = 1.5
threshold = <whatever threshold you landed on>

def convert_threshold_old_to_new(threshold, sensitivity):
    sensitivity = -sensitivity
    return (1 / sensitivity) * np.log((1 - threshold) / threshold)

convert_threshold_old_to_new(threshold, sensitivity):

This will work for other sensitivity values too.

1 Because raw logit values are clipped in the original sensitivity formula, this holds true for scores between c. 0.999999 and 0.0000003

Mattk70 avatar Jul 31 '25 07:07 Mattk70

Thank you for the explanation, @Mattk70 -- that is great news and will be a huge time-saver. Some of these implications don't become clear for us on-the-ground practitioners until there are concrete examples. That explanation will be helpful to add to the docs.

I'm wondering if this solution also answers @abfleishman's question above, or if there is more nuance that I'm not catching. For projects where practitioners are choosing defensible confidence score thresholds, I know it's a common practice to use performance evaluation metrics like precision/recall/F1 score, or to compute probabilistic score thresholds based on logistic regression (as in Wood and Kahl 2024). I'm wondering if the same advice applies regardless, or if there are any subtleties practitioners should keep in mind.

Incidentally, it also follows that sensitivity does not improve or degrade model performance per se, it simply changes model performance at a given threshold. And that leads me to wonder, do you really need to change sensitivity when you'll get the same outcome with a different threshold?

I agree with this sentiment. As a practitioner, I don't care much about what the sensitivity value is. But I care very much that I can defensibly quantify the uncertainty and credibility of the results, and that I have a solid data management and metadata tracking system so that I am not accidentally applying differently behaving parameters to new monitoring data and then spuriously comparing that with previous years' data. That's why this conversation is important, and thank you for engaging! I wonder if the subtext of your question is, "why bother with a sensitivity parameter at all?", which is an interesting question that I don't feel qualified to answer.

cbalantic avatar Jul 31 '25 15:07 cbalantic

However, results at any given threshold (i.e. predictions meeting that threshold) at sensitivity = 1 can be replicated exactly with an adjusted threshold when sensitivity != 1.

I'm following up with some observations. I tested applying this solution to a real world phenology monitoring dataset where we track metrics such as date of first and last calling, total days of calling, and date of peak detection activity. In this dataset, we were monitoring for a single species and used a sensitivity of 1.5 in BirdNET V1. Based on expert verifications of detections, we selected a birdnet "confidence threshold" value below which to discard detections based on a logistic regression pr(True) = 0.99.

I applied the above formula to choose an adjusted confidence threshold that would, hopefully, create an equivalent of choosing sensitivity = 1 with data processed through BirdNET V1. I'm comparing the results of the following combinations:

  • Results A: BirdNET V1, sensitivity = 1.5, absolute value of new confidence threshold identified via this formula to create a dataset equivalent to sensitivity = 1
  • Results B: BirdNET V2, sensitivity = 1, same confidence threshold as in Results A, all other parameters the same as Results A

(@Mattk70 -- is this the type of comparison you were intending?)

I am finding that the results are generally very similar after applying the confidence threshold shift, but there are occasional slight differences in total number of daily detections (e.g., median 8.5 fewer detections per day for Results B vs. A), sometimes leading to differences in identified peak dates, total days of calling, and other metrics.

My applied observation is that even with attempting to apply the adjusted "sensitivity = 1" threshold, this change still functionally acts as a change to the detection model, which is slightly affecting aggregated data and downstream modeling uses. Maybe there's a bug in my workflow, but in this case, from a long-term data management and annual monitoring perspective, I'm considering that the cleanest solution may be to reprocess the entire data set through V2 (with sensitivity = 1). Just wanted to report back with some impressions in case I've misinterpreted anything above!

cbalantic avatar Aug 14 '25 15:08 cbalantic