How to capture the lifecycle of a predicted and then later curated mapping?

Open cthoyt opened this issue 8 months ago • 1 comments

Let's say I generate an exact match using a lexical mapping. My mapping tool gives a confidence of 0.7. So I get SSSOM like

subject_id	subject_label	predicate_id	object_id	object_label	mapping_justification	confidence	mapping_tool
CHEBI:134180	leucomethylene blue	skos:exactMatch	mesh:C011010	hydromethylthionine	semapv:LexicalMatching	0.7	generate_chebi_mesh_mappings.py

Then, I review this mapping. I say that it's correct with 0.95 confidence. How do I represent this? Here are some options I thought of:

Add an author_id column with my ORCID, and swap the mapping justification to semapv:ManualMappingCuration. Overwrite the confidence from 0.7 to 0.95
Add a reviewer_id column with my ORCID. But then, how do I represent that I have a confidence as a reviewer? Do I throw away the mapping tool's confidence? What if I want to keep track of this?
Some other way? Please also let me know if I've misunderstood how to use author_id/creator_id/reviewer_id

The use case for this question is Biomappings, since we do lexical predictions and curate them, and want to keep track of this provenance.

Given the answer to this question, it will also be possible to generalize the Biomappings curation interface to be a generic SSSOM curation interface

Apr 25 '25 22:04 cthoyt

This issue is a bit debated; last time we tried to do this we didn't reach a definite conclusion: https://github.com/mapping-commons/sssom/issues/345

In a nutshell:

Separating mapping processes during the curation life cycle was not a primary concern of the design of SSSOM, so it was all mushed together into one record
The idea is that 1 single score is there to tell a downstream user "how sure they can be"
If you absolutely want to represent the life cycle, you will have to create intermediate mapping sets, so you say:
1. Mapping set 1 derived from lexical matching (semapv:LexicalMatching)
2. Mapping set 2 reviewers mapping set 1 (semapv:MappingReview), and sets mapping_set_source to mapping set 1
3. Mapping set 3 derived from 1 and 2, referring to both, generating a composite score and using sempav:CompositeMatching or some such as a justification. This last set is the only one you publish the the world.

None of this is super awesome. Another option to make this a bit cleaner would be to push for #359 and then a new slot source_mapping that you can use to point specifically to the mappings used for deriving a particular new mapping..

None of this is normative, just spitballing.

Apr 27 '25 07:04 matentzn