sssom icon indicating copy to clipboard operation
sssom copied to clipboard

How to capture the lifecycle of a predicted and then later curated mapping?

Open cthoyt opened this issue 8 months ago • 1 comments

Let's say I generate an exact match using a lexical mapping. My mapping tool gives a confidence of 0.7. So I get SSSOM like

subject_id subject_label predicate_id object_id object_label mapping_justification confidence mapping_tool
CHEBI:134180 leucomethylene blue skos:exactMatch mesh:C011010 hydromethylthionine semapv:LexicalMatching 0.7 generate_chebi_mesh_mappings.py

Then, I review this mapping. I say that it's correct with 0.95 confidence. How do I represent this? Here are some options I thought of:

  1. Add an author_id column with my ORCID, and swap the mapping justification to semapv:ManualMappingCuration. Overwrite the confidence from 0.7 to 0.95
  2. Add a reviewer_id column with my ORCID. But then, how do I represent that I have a confidence as a reviewer? Do I throw away the mapping tool's confidence? What if I want to keep track of this?
  3. Some other way? Please also let me know if I've misunderstood how to use author_id/creator_id/reviewer_id

The use case for this question is Biomappings, since we do lexical predictions and curate them, and want to keep track of this provenance.

Given the answer to this question, it will also be possible to generalize the Biomappings curation interface to be a generic SSSOM curation interface

cthoyt avatar Apr 25 '25 22:04 cthoyt

This issue is a bit debated; last time we tried to do this we didn't reach a definite conclusion: https://github.com/mapping-commons/sssom/issues/345

In a nutshell:

  1. Separating mapping processes during the curation life cycle was not a primary concern of the design of SSSOM, so it was all mushed together into one record
  2. The idea is that 1 single score is there to tell a downstream user "how sure they can be"
  3. If you absolutely want to represent the life cycle, you will have to create intermediate mapping sets, so you say:
    1. Mapping set 1 derived from lexical matching (semapv:LexicalMatching)
    2. Mapping set 2 reviewers mapping set 1 (semapv:MappingReview), and sets mapping_set_source to mapping set 1
    3. Mapping set 3 derived from 1 and 2, referring to both, generating a composite score and using sempav:CompositeMatching or some such as a justification. This last set is the only one you publish the the world.

None of this is super awesome. Another option to make this a bit cleaner would be to push for #359 and then a new slot source_mapping that you can use to point specifically to the mappings used for deriving a particular new mapping..

None of this is normative, just spitballing.

matentzn avatar Apr 27 '25 07:04 matentzn