dwc icon indicating copy to clipboard operation
dwc copied to clipboard

Proposal: basisOfIdentification and identificationConfidence

Open ianengelbrecht opened this issue 5 years ago • 12 comments

May I propose these properties be considered for the Identification class? They are described in the Barcode of Wildlife standards. It may be a good idea to separate the concept of confidence out of dwc:identificationQualifier. This article on open nomenclature provides a nice clarification of what aff., cf. and identification certainty are and how they relate to each other.

ianengelbrecht avatar Mar 19 '19 14:03 ianengelbrecht

I see that the definition for dwc:nameAccordingTo says 'For taxa that result from identifications, a reference to the keys, monographs, experts and other sources should be given'. This gives dual meaning to this term depending on what kind of dataset it is. If basisOfIdentification is added as an Identification term these could be separated.

ianengelbrecht avatar Oct 10 '19 11:10 ianengelbrecht

I'm interested in being able to specify the "confidence" in a detection/identification. I would suggest to specify this as a numerical probability between 0 and 1 inclusive. Although not everyone thinks in terms of probabiities, I hope that format would be the least susceptible to misinterpretation, and would also be usable in further analysis.

(I'm also interested in the ability to express multiple possible species identifications e.g. {"Luscinia megarhynchos": 0.6, "Luscinia luscinia": 0.3}, but perhaps that's outside the scope of this thread?)

danstowell avatar Oct 31 '19 14:10 danstowell

Before recommending a term identificationConfidence I think some research and discussion is needed. How could this be calculated objectively, or should is be a controlled vocabulary. Who should determine it? Which identification does it refer too? Shouldn't it be in the Identification History extension, rather than the main part of Darwin Core. identificationQualifier already exists in the extension. So I would recommend that this issue becomes part of an identifications task group that could address the many issues about identifications in DwC.

qgroom avatar Nov 09 '19 18:11 qgroom

I agree with @Quentin Groom [email protected] about thinking of this in terms of an extension. I have the same concerns. There is also the terms identificationVerificationStatus to consider, which has a bit of overlap with what is being proposed for identificationConfidence.

On Sat, Nov 9, 2019 at 3:20 PM Quentin Groom [email protected] wrote:

Before recommending a term identificationConfidence I think some research and discussion is needed. How could this be calculated objectively, or should is be a controlled vocabulary. Who should determine it? Which identification does it refer too? Shouldn't it be in the Identification History extension, rather than the main part of Darwin Core. identificationQualifier already exists in the extension. So I would recommend that this issue becomes part of an identifications task group that could address the many issues about identifications in DwC.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tdwg/dwc/issues/217?email_source=notifications&email_token=AADQ72737643DHZZ4VWNTPTQS35NJA5CNFSM4G7Q5NQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDUMAVQ#issuecomment-552124502, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ727STEVAIL4KFNG6ABTQS35NJANCNFSM4G7Q5NQQ .

tucotuco avatar Nov 09 '19 18:11 tucotuco

I'm fully agnostic about whether Ian's proposal should be in core or extension, since I've not been involved before.

Ian's original proposal mentions "It may be a good idea to separate the concept of confidence out of dwc:identificationQualifier".

I take it the confidence would be determined by the source asserting the overall record (as are the other fields, after all), and would refer to the overall record. (Would love to have confidence on a per-field basis but probably too complex for this standard.)

If "controlled vocabulary" is the consensus position, I'd hope for a vocabulary that can be mapped onto probabilities. A good example may be the IPCC vocabulary about uncertainties - see Table 1 in this PDF.

danstowell avatar Nov 09 '19 19:11 danstowell

My feeling is that we should aim to model current practice with identifications while at the same time promoting good practice. The article mentioned in the original post describes nicely the (potential) dichotomy between what is meant with cf. versus the proposed dwc:identificationConfidence. Briefly, they refer to different forms of uncertainty that might exist in an identification. The first is uncertainty about whether a particular specimen fits into a taxon concept. The person doing the identification is confident in their knowledge of the taxonomy for the group and the taxa are quite clearly defined, but the specimen doesn’t quite fit a known taxon. The specimen is also not clearly something new (in which case the qualifer ‘aff.’ would be used, and there is no uncertainty). On the other hand, dwc:identificationConfidence should represent the uncertainty as a result of limited knowledge of the identifier. It’s the equivalent of a question mark after a taxon name. An example might be that I look at a theraphosid spider that I think might be Ceratogyrus pillansi. The type specimen is lost, and the type locality is imprecise (Rhodesia), so I write on the det. label ‘Ceratogyrus pillansi?’ Changing species to something where the taxon concept is clearer, having two separate terms would allow for something like ‘Ceratogyrus aff. darlingi?’ which is equivalent to ‘Ceratogyrus cf. darlingi’. We could also have ‘Ceratogyrus cf. darlingi?’ I’ve never seen this but it it would equate to ‘Ceratogyrus darlingi?’.

In practice we see a fair number of both cf. and ? on specimen labels during data capture. In discussions with taxonomic experts the response to using two different indicators of uncertainty has been mixed, and seems to depend on the discipline and the preferences of the individual.

If identificationConfidence were to be adopted, my own feeling is that it should only ever be binary (confident or not confident) or a probability based on a valid quantitative analysis, along the lines of @danstowell’s suggestion above. What I feel MUST be avoided is a list of ordinal levels of certainty. I’ve used these in existing applications (iSpot being one) and even implemented it in my own databases in the past. All you end up with is people confused as to whether they are ‘certain’, ‘very certain’, or ‘highly certain’ about their identifications.

ianengelbrecht avatar Nov 10 '19 08:11 ianengelbrecht

@Ian Engelbrecht [email protected] Does identificationVerificationStatus ( http://rs.tdwg.org/dwc/terms/#dwc:identificationVerificationStatus) not cover the same concept as the proposed identificationConfidence?

On Sun, Nov 10, 2019 at 5:18 AM Ian Engelbrecht [email protected] wrote:

My feeling is that we should aim to model current practice with identifications while at the same time promoting good practice. The article mentioned in the original post describes nicely the (potential) dichotomy between what is meant with cf. versus the proposed dwc:identificationConfidence. Briefly, they refer to different forms of uncertainty that might exist in an identification. The first is uncertainty about whether a particular specimen fits into a taxon concept. The person doing the identification is confident in their knowledge of the taxonomy for the group and the taxa are quite clearly defined, but the specimen doesn’t quite fit a known taxon. The specimen is also not clearly something new (in which case the qualifer ‘aff.’ would be used, and there is no uncertainty). On the other hand, dwc:identificationConfidence should represent the uncertainty as a result of limited knowledge of the identifier. It’s the equivalent of a question mark after a taxon name. An example might be that I look at a theraphosid spider that I think might be Ceratogyrus pillansi. The type specimen is lost, and the type locality is imprecise (Rhodesia), so I write on the det. label ‘Ceratogyrus pillansi?’ Changing species to something where the taxon concept is clearer, having two separate terms would allow for something like ‘Ceratogyrus aff. darlingi?’ which is equivalent to ‘Ceratogyrus cf. darlingi’. We could also have ‘Ceratogyrus cf. darlingi?’ I’ve never seen this but it it would equate to ‘Ceratogyrus darlingi?’.

In practice we see a fair number of both cf. and ? on specimen labels during data capture. In discussions with taxonomic experts the response to using two different indicators of uncertainty has been mixed, and seems to depend on the discipline and the preferences of the individual.

If identificationConfidence were to be adopted, my own feeling is that it should only ever be binary (confident or not confident) or a probability based on a valid quantitative analysis, along the lines of @danstowell https://github.com/danstowell’s suggestion above. What I feel MUST be avoided is a list of ordinal levels of certainty. I’ve used these in existing applications (iSpot being one) and even implemented it in my own databases in the past. All you end up with is people confused as to whether they are ‘certain’, ‘very certain’, or ‘highly certain’ about their identifications.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tdwg/dwc/issues/217?email_source=notifications&email_token=AADQ72ZIYHQXZHUIAVLCRJTQS67U5A5CNFSM4G7Q5NQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDUX6OY#issuecomment-552173371, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ727XOLC4BV5EP7FQ4HTQS67U5ANCNFSM4G7Q5NQQ .

tucotuco avatar Nov 10 '19 13:11 tucotuco

Apologies for the delay in responding. My feeling is not. Verification should be a separate process to making an identification. The workflow should be that one person identifies a specimen, and someone else, ideally with better knowledge of the taxon or more experienced, verifies that identification or corrects it with their own, new, identification. I indicated a similar line of thinking for georeferenceVerificationStatus, and proposed georeferenceVerifiedBy and georeferenceVerifiedDate properties there (the equivalents for identification verification would be required too). A person shouldn't be able to verify their own identifications (or georeferences), unless perhaps by another method, such as confirmation of a morphological identification using molecular data. A real world example of identification verification is iSpot, which has an 'I agree with this ID' on it's identification form, which is only available to others, and iSpot records who agrees with an identification. On the contrary, I don't know of any collections databases that implement identification verifications.

ianengelbrecht avatar Jan 08 '20 11:01 ianengelbrecht

An alternative term for identificationConfidence may be identificationCertainty

ianengelbrecht avatar Jan 08 '20 11:01 ianengelbrecht

Regarding additional fields for ...verifiedBy and ...verifiedDate, an alternative might be simply record all of that information in ...verificationStatus.

ianengelbrecht avatar Jan 08 '20 11:01 ianengelbrecht

For basisOfIdentification:

Definition: The method, tool, or rationale used in identifying the specimen. Comments: Recommended best practice is to use a controlled vocabulary. Examples: 'tacit expertise', 'field guide', 'key', 'DNA' [or perhaps 'BLAST' or other algorithm used], 'type material for taxon', 'compared with type material', 'compared with non-type material'.

ianengelbrecht avatar Jun 24 '20 12:06 ianengelbrecht

I think this is interesting, but it all depends on it having a good vocabulary, otherwise it is better to just use identificationRemarks.

I use phrases like 'field det.' and 'duplicate det.' (I work on mosses and have found that "duplicates" not necessarily belong to the same species). With more and more specimen images becoming available on line virtual determinations based on images also has become a thing.

And then there is AI of course.

Will be nice to have a term like this, which would be mostly 'morphology' for me and then the detail in the identificationRemarks.

nielsklazenga avatar Sep 24 '20 13:09 nielsklazenga

identificationConfidence used with identificationConfidenceType (same as dwc:organismQuantity used with dwc:organismQuantityType) could make it self-explanatory somewhat and avoid the necessity of clear definition or objective calculation.

The possible identificationConfidenceType includes:

  1. Two-level: identificationConfidence could be one of {Unsure, Sure} or {0, 1} which could be decided and marked by user easily.
  2. Three-level: identificationConfidence could be one of {High, Medium, Preliminary} showed in https://bwp-informatics.readthedocs.io/en/latest/bwp_data_standard.html or {0, 1, 2} which could be decided and marked by user easily.
  3. Probability: identificationConfidence could be a continous numerical value in [0, 1] which often generated by AI algorithm. The probability generated by different algorithm has different meaning and is incomparable, as such we can use different identificationConfidenceType to differentiate each other.

quarrying avatar Dec 20 '22 13:12 quarrying

Closing for lack of evidence of demand.

tucotuco avatar Mar 01 '24 18:03 tucotuco

By the way, Camtrap-DP has a key "classificationProbability" which has a similar role as identificationConfidence. For more background on the discussion, see Camtrap-DP issue 170 and Camtrap-DP issue 217.

(See also related discussion OBIS issue 209 which I wasn't aware of until just now.)

danstowell avatar Mar 01 '24 19:03 danstowell