ontology-access-kit OAK generated mapping sets should be SSSOM conformant

The latest version of OAK is awesome, and thinks are looking much better. There are a number of things that should be considered for making sure OAK generated sssom files are valid SSSOM. I think these should be implemented both for the mappings and the lexmatch commands.

[ ] An empty mapping set should still have the column headers (at least the required ones, currently subject_id, object_id, predicate_id and mapping_justification. At the moment at least the mappings command does not include the columns when the result is empty, which makes it impossible to process the mapping with sssom standard conforming tools afterwards.
[ ] https://mapping-commons.github.io/sssom/subject_source/ should be a reference, not a string (when validating, we get jsonschema.exceptions.ValidationError: Slot 'object_source' has an incorrect value: UBERON). I would suggest the following recipe for extracting a reference: if ontology_iri (e.g. http://purl.obolibrary.org/obo/mondo.owl), take it, CURIEfy it (obo:mondo.owl). Else, assume https://w3id.org/oak/unknown_prefix/{filename}, CURIEfy it (OAK_UNKNOWN_PREFIX:{filename} (e.g. OAK_UNKNOWN_PREFIX:mondo.owl). keep ontology_iri general as we may use other properties in the future to get the information. Due to the "adoption" discussion its better not to use the prefix itself to determine the subject_source. (both mapping and mapping_set level). Dont forget to add prefixes generated this way to the curie_map.
[ ] Mapping set id should not be temp but a random IRI as per sssom-py preferences.
[ ] There should be exactly 1 (not 2, not 0) empty lines at the end of a mapping set.
[ ] http://www.ebi.ac.uk/cmpo/CMPO_0000364 notation should be avoided for id columns. I think we should first check if bioregistry can compress the URI, and if not, spit out a warning and not include the record.

Mar 26 '23 11:03 matentzn

[ ] Careful with the %_source suggestion above. There is an option --maps-to-source that needs to be rethought in light of this.

Actually now I see. Perhaps the only think that needs to happen instead of all this complicated fluff I explain up there is to prepend a colon? like UBERON: would mean UBERON. But this does not solve the issue with adoptions.

Alternatively (and I would be good with that) we could simply drop the _source properties from OAK.

Mar 26 '23 11:03 matentzn

It would also be good to update the sssom and sssom-schema versions in the OAK dependencies. I'm getting incompatibility using poetry.

Apr 05 '23 11:04 anitacaron

Thanks @anitacaron! @hrshdhgd can this be prioritised?

Apr 05 '23 12:04 matentzn

@anitacaron , are you using the latest version of oaklib ? I have the latest version for sssom in my poetry.lock file. With regards to sssom-schema, @matentzn doesn't sssom-py dictate which version to use?

Apr 05 '23 13:04 hrshdhgd

I don’t think source should include file serialization suffixes.

I don’t think we should use unknown

What about infores?

On Sun, Mar 26, 2023 at 4:32 AM Nico Matentzoglu @.***> wrote:

The latest version of OAK is awesome, and thinks are looking much better. There are a number of things that should be considered for making sure OAK generated sssom files are valid SSSOM. I think these should be implemented both for the mappings and the lexmatch commands.

An empty mapping set should still have the column headers (at least the required ones, currently subject_id, object_id, predicate_id and mapping_justification. At the moment at least the mappings command does not include the columns when the result is empty, which makes it impossible to process the mapping with sssom standard conforming tools afterwards.

https://mapping-commons.github.io/sssom/subject_source/ should be a reference, not a string (when validating, we get jsonschema.exceptions.ValidationError: Slot 'object_source' has an incorrect value: UBERON). I would suggest the following recipe for extracting a reference: if ontology_iri (e.g. http://purl.obolibrary.org/obo/mondo.owl), take it, CURIEfy it (obo:mondo.owl). Else, assume https://w3id.org/oak/unknown_prefix/{filename}, CURIEfy it (OAK_UNKNOWN_PREFIX:{filename} (e.g. OAK_UNKNOWN_PREFIX:mondo.owl). keep ontology_iri general as we may use other properties in the future to get the information. Due to the "adoption" discussion its better not to use the prefix itself to determine the subject_source. (both mapping and mapping_set level). Dont forget to add prefixes generated this way to the curie_map.

Mapping set id should not be temp but a random IRI as per sssom-py preferences.

There should be exactly 1 (not 2, not 0) empty lines at the end of a mapping set.

— Reply to this email directly, view it on GitHub https://github.com/INCATools/ontology-access-kit/issues/499, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOKIWJON5T7LWZACPQ3W6AZN3ANCNFSM6AAAAAAWID7Q6M . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Apr 05 '23 14:04 cmungall

@hrshdhgd sssom and sssom-schema need to be independent.

Apr 05 '23 14:04 anitacaron