camtrap-dp
camtrap-dp copied to clipboard
Rename classificationConfidence to classificationProbability, and note that it's usually omitted for human classifications
…assifications
See #170
This should be revised as a quantitative measurement returned by a computer vision model. Qualitative descriptions are highly subjective and not very useful.
@ben-norton the text is all about quantitative, nothing qualitative. Do you perhaps have in mind that estimated probabilities are subjective/not useful? If so, I agree, and that's why I suggest this edit.
(It's possible (but rare) to derive probability estimates from manual labels, e.g. via voting or inter-rater reliability calculations.)
@danstowell Computer vision models return a vector of probabilities for every object a model has been trained to classify. You can set a reporting threshold (e.g., only results > 80%) to omit most of them. You end up with 4 or 5. If an expert confirms the top result, that leaves 1. The term 'estimated probability' is not an concise representation, nor is the term confidence. It's the probability, measured by a machine, that the model accurately classified the species in an image. I would suggest classifierProbability or something similar.
I would suggest to add #217 to this PR and update the definition accordingly.
Thanks @kbubnicki - I agree and have pushed a commit to my PR which does this.
@ben-norton @danstowell @kbubnicki are you fine with naming it classificationProbability rather than classifierProbability? That keeps it in line with the names of related terms. Also classifier is not referenced anywhere and the definition does describe it as:
Certainty of the classification. Expressed as a probability, with
1being maximum confidence. For human classifications, omit this field (in CSV, an empty string) or use an approximate probability if available.
Rather than writing "For human classifications, omit this field (in CSV, an empty string) or use an approximate probability if available.", I would update to:
Certainty of the machine classification. Expressed as a probability, with
1being the maximum confidence.
Yes.
One edit to increase coverage of use case: For human classifications...
For qualitative techniques (e.g. visual observation by human)...
Edit 2. Replace probability with data type.
Expressed as a probability,
Probability should be expressed as a numerical value (maximum 4 decimal places)
The number of decimal places is not especially important, but the number represents probability that is expressed as a numerical value
Question ...approximate probability
How?
@ben-norton the current phrasing is in line with how I phrase other definitions, so I'd like to keep that structure:
Certainty of the machine classification. Expressed as a probability, with
1being the maximum confidence.
- It does not list human classification probability and avoids the "approximate probability" issue.
- We typically don't express wishes regarding number of decimals. The data type is numerical, with min: 0, max: 1, which should be sufficient
Any changes you still want included in the definition?
Good point -- I've edited classifierProbability to classificationProbability, thanks.
Regarding probabilities and humans:
Ben asks "how [to find approximate probabilities for human decisions]?" -- There are a few ways to do this, from self-asserted probabilities (not very reliable) to calibrations based on individual users' "skill" within a system (better). None of these lead to fantastic probability estimates, but it's not for us to rule this out entirely. It's very easy for data consumers to ignore any probabilities where classificationMethod=human.
Remember that machine-estimated probabilities can often be terrible too!
I understand why @peterdesmet doesn't want to mention human probabilities. My opinion is: (a) it's sometimes possible to elicit probabilities from human observers and I don't see why we should rule it out entirely. (b) If we simply decide not to mention how to use this field for the case of human observations, there's some risk of confusion if users read the doc and assume the field should always have some number in it. That's why I think it's useful to write, as I did, "For human classifications, omit this field (in CSV, an empty string)"
We're pretty close to agreeing, but do let me know how far I've persuaded you re the humans...
I'm fine with humans expressing their confidence (even though it might be hard). I therefore don't think we should express anything like "for human classifications, omit this field or use an approximate probability".
Maybe it's just me, but I find it a bit confusing to have three different terms (confidence, accuracy and probability) in the term+definition. Can we reduce this?
I made the language a bit more self-consistent (#cbebacb). I do not agree that we should delete the comment about human classifications. The reason is this: If we don't make it clear that the classificationProbability is typically not available for human judgments, then it's relatively likely that the field will become polluted by badly-estimated probabilities from users who mistakenly assume they should fill it in.
Thanks for your work and patience on this @danstowell! I've updated your description from:
Degree of certainty of the classification. Expressed as a probability, with
1being maximum certainty. For human classifications, omit this field (in CSV, an empty string) or use an approximate probability if available.
To:
Degree of certainty of the (most recent) classification. Expressed as a probability, with
1being maximum certainty. Omit or provide an approximate probability for human classifications.
I prefer not to specify "in CSV, an empty string" as that applies to all fields.
This PR will be merged in another branch with a whole number of upcoming changes, so won't be immediately reflected on the website.
Fix #217