OBOFoundry.github.io
OBOFoundry.github.io copied to clipboard
Hosting an official "OBO context" of all prefixes relevant to biological data integration
In our OBO universe, we mostly care about OBO purls and curies, which is one of the most important achievements of our community. For example, the PURL http://purl.obolibrary.org/obo/CL_123 (and the corresponding CURIE CL:123) represent an entity in the CL ontology.
The reality is that OBO is merging more and more with a wider, interconnected world of biological and biomedical data efforts, whether they are scientific databases, clinical efforts or ontology standards, and it makes sense to try and organise our relationship with these a bit more.
One universal problem we face is the interpretation and representation of cross-references to biological databases (such as reactome, uniprot and more) or medical terminologies (such as MeSH, SNOMED or MEDDRA). We typically represent cross references as "CURIE strings", such as OMIM:231200 and link them to the ontology concepts using the oboInOwl:hasDbXref relationship.
Now there are two things we may want to do with this cross reference:
- We may wish to provide a linkout to the related resource, so that a user can "look at additional information about this concept"
- We may want to offer the opportunity for data integration efforts to connect information related to both resources (the ontology and the referenced external resources)
Either use case requires the expansion of the CURIE (e.g. OMIM:231200) to a URL (e.g. https://omim.org/entry/231200 or https://identifiers.org/meddra:10015919). The problem, however, is that not only can the the CURIE prefix have 20 alternatives (omim, mim, MIM), but even worse (for us), there can be dozens of valid URI expansions (https://omim.org/MIM:603903, https://www.omim.org/entry/603903 and many many more), which means that the datasets we make publicly available need to be, cumbersomely, stiched together using custom ETL pipelines after the fact.
For these kinds of reasons (making integration easier), it makes sense to try and unify the use of CURIE prefixes ("when you provide an omim identifier, you use the prefix OMIM, not mim, not omim, not MIM") and URI prefixes ("when you expand an OMIM CURIE, you use https://omim.org/MIM:603903").
@cthoyt has been at the forefront of an effort to trying an bring some order into the current anarchy. One of the concepts he has developed and I personally just find super awesome is the idea of "organisational context" that are hosted as part of bioregistry, which are powered by an enormous database of curie-prefix/uri-prefix combinations, but allow for the flexibility of hardcoding certain preferences such as "we know what the PubMed would like us to use the pubmed prefix for CURIEs, but we really want to us PMID for historical reasons and more". To this end, @anitacaron and me are maintaining a context we wholly illegally called the "OBO context", which reflects some of our communities preferences: https://bioregistry.io/context/obo. The second cool piece of the system @cthoyt developed are so call "Extended Prefix Maps (EPM)". Here is an example. The cool thing is that the EPM not only contains a clean prefix map, it also contains all existing synonyms, for example:
"pattern": "^C?\\d+$",
"prefix": "Orphanet",
"prefix_synonyms": [
"ordo",
"orphanet.ordo"
],
"uri_prefix": "http://www.orpha.net/ORDO/Orphanet_",
"uri_prefix_synonyms": [
"http://bioregistry.io/ordo:",
"http://bioregistry.io/orphanet.ordo:",
"http://identifiers.org/orphanet.ordo/",
....
TLDR:
We need a way to contract and expand CURIEs / URIs to facilitate data integration in the OBO domain beyond our OBO PURLs.
I am proposing to host the obo context we have been using in some of our software packages on OBOFoundry.github.io. The idea is that once per month, a GitHub action will pull updates to the context from Bioregistry, and make a PR here in the repo with the proposed changes. Then, a TWG member will review the changes, and flag controversial changes and reflect them in the bioregistry context. And then, we promote the use of the EPM hosted here, on OBOFoundry.github.io universally as the source for prefix compression and expansion for the entire OBO community (not as a law, just as a "SHOULD" type of thing).
Let me know what you think!
Sounds great, this is what https://github.com/linkml/prefixmaps intended to do
The hard part here is deciding what the canonical PURL should be. 10 years ago we decided identifiers.org with http and slashes, that didn't turn out well.
I think a lot of the decisions will be made de-facto by pyobo which is making ontology files for many of the databases we care about https://github.com/biopragmatics/pyobo/issues/333
We still have no mechanism for reconciling this with decisions made by e.g. PR for genes, GO/neo, etc, and issues get distributed confusingly across many repos....
Minor note:
For these kinds of reasons (making integration easier), it makes sense to try and unify the use of CURIE prefixes ("when you provide an omim identifier, you use the prefix OMIM, not mim, not omim, not MIM")
OMIM has clearly stated that their prefix should be MIM. I think we should modify the documents to reflect this. If you use MIM:### then it doesn't matter if the record is an disease/gene or a phenotypic series the URL https://omim.org/MIM:#### will work. If you use OMIM:### then you need to differentiate between OMIM:#### and OMIM:PS#### as the URLs have to be different for each (https://www.omim.org/phenotypicSeries/PS168600 vs https://www.omim.org/entry/600116) or strip off the O before making the url.
Sounds great, this is what https://github.com/linkml/prefixmaps intended to do
I think they can exist side by side, as the coverage of prefixmaps is a little less than the entirety of bioregistry... Are you saying. this because you would like to propose LinkML prefixmaps to be the SOT for the prefixmap?
OMIM has clearly stated that their prefix should be MIM.
Yeah, sure. There will be various disagreements on various prefixes - I would not make that now a dependency on the decision to create some initial proposal of prefixes, and we can then deal with the disagreements separately. (I am ok with changing the OMIM one of course, but I maintain that should be dealt with after we agree as a foundry to maintain some sort of list we can debate over).
Thanks both!
Like Nico mentioned, I am highly motivated to make sure the Bioregistry provides a technical stack for creating and maintaining an OBO context. If there are any new parts to the "context" curation/export workflow that are needed to satisfy the OBO Foundry, I will make sure these get priority.
Chris mentioned
I think a lot of the decisions will be made de-facto by pyobo which is making ontology files for many of the databases we care about https://github.com/biopragmatics/pyobo/issues/333
This is not accurate (anymore). All PyOBO exports are now 100% aligned with the Bioregistry, so we can fully center all discussions about what the PURL expansions should be there, and not in the PyOBO repository.
OMIM has clearly stated that their prefix should be MIM.
WRT specific discussions like MIM/OMIM prefix, there are two ways to address this:
- Create a global bioregistry change (requires more discussion)
- Make whichever one the OBO community decides on as part of the OBO Context. Ideally, this isn't needed except in special circumstances
There already was an OMIM discussion, but it was very difficult to moderate, so I wasn't personally able to follow up with any changes. If someone wants to start that discussion again, I will support it
Due to the lack of response we have now decide to include the current EPM as an unofficial SOT into the next version of ODK; Hopefully we can handle all prefix mismatches across ontologies even if this remains and obo community rather than an obo foundry project