How to capture the lifecycle of a predicted and then later curated mapping?
Let's say I generate an exact match using a lexical mapping. My mapping tool gives a confidence of 0.7. So I get SSSOM like
| subject_id | subject_label | predicate_id | object_id | object_label | mapping_justification | confidence | mapping_tool |
|---|---|---|---|---|---|---|---|
| CHEBI:134180 | leucomethylene blue | skos:exactMatch | mesh:C011010 | hydromethylthionine | semapv:LexicalMatching | 0.7 | generate_chebi_mesh_mappings.py |
Then, I review this mapping. I say that it's correct with 0.95 confidence. How do I represent this? Here are some options I thought of:
- Add an
author_idcolumn with my ORCID, and swap the mapping justification tosemapv:ManualMappingCuration. Overwrite the confidence from 0.7 to 0.95 - Add a
reviewer_idcolumn with my ORCID. But then, how do I represent that I have a confidence as a reviewer? Do I throw away the mapping tool's confidence? What if I want to keep track of this? - Some other way? Please also let me know if I've misunderstood how to use
author_id/creator_id/reviewer_id
The use case for this question is Biomappings, since we do lexical predictions and curate them, and want to keep track of this provenance.
Given the answer to this question, it will also be possible to generalize the Biomappings curation interface to be a generic SSSOM curation interface
This issue is a bit debated; last time we tried to do this we didn't reach a definite conclusion: https://github.com/mapping-commons/sssom/issues/345
In a nutshell:
- Separating mapping processes during the curation life cycle was not a primary concern of the design of SSSOM, so it was all mushed together into one record
- The idea is that 1 single score is there to tell a downstream user "how sure they can be"
- If you absolutely want to represent the life cycle, you will have to create intermediate mapping sets, so you say:
- Mapping set 1 derived from lexical matching (semapv:LexicalMatching)
- Mapping set 2 reviewers mapping set 1 (semapv:MappingReview), and sets mapping_set_source to mapping set 1
- Mapping set 3 derived from 1 and 2, referring to both, generating a composite score and using
sempav:CompositeMatchingor some such as a justification. This last set is the only one you publish the the world.
None of this is super awesome. Another option to make this a bit cleaner would be to push for #359 and then a new slot source_mapping that you can use to point specifically to the mappings used for deriving a particular new mapping..
None of this is normative, just spitballing.