OBOFoundry.github.io icon indicating copy to clipboard operation
OBOFoundry.github.io copied to clipboard

Check for IDSPACE conflicts on new ontology submission

Open jamesaoverton opened this issue 3 years ago • 24 comments

#1703 indicates a larger problem with namespace conflicts.

New projects should be encouraged (required?) to check for conflicts. Checking https://bioregistry.io/ would be the easiest and most effective place to look currently. The easiest thing would be to update our instructions. Better would be an automated check, maybe in https://github.com/OBOFoundry/obo-nor.github.io.

Our current documentation https://obofoundry.org/id-policy.html#allocating-idspaces points to http://identifiers.org/, but it does not include "EPSO" and would not have helped in this case.

jamesaoverton avatar Dec 16 '21 13:12 jamesaoverton

Thanks for writing this up, James. As we've noted on https://github.com/biopragmatics/bioregistry/issues/273, the Bioregistry does not a likely will never consume the full BioPortal, so we should consider the other aspect of whether the OBO Foundry would want to fully respect the prefixes minted in the BioPortal or not (e.g., there can be and in some cases already are nonsensical overlaps/conflicts with high quality resources in the OBO Foundry, etc.)

cthoyt avatar Dec 16 '21 14:12 cthoyt

@cthoyt Do you have a list of conflicts between BioPortal and OBO by any chance?

matentzn avatar Dec 16 '21 14:12 matentzn

Cross post of https://github.com/biopragmatics/bioregistry/issues/273#issuecomment-995894865:

The conflicts that I've curated manually are all in this file https://github.com/biopragmatics/bioregistry/blob/main/src/bioregistry/data/mismatch.json

The curation sheet for BioPortal (which represents all unaligned prefixes) is https://github.com/biopragmatics/bioregistry/blob/main/src/bioregistry/data/external/bioportal/curation.tsv. Because the BioPortal API does not let you get direct access to most of the metadata for each entry, unfortunately the only thing that's represented in this sheet is the BioPortal prefix and BioPortal name. This makes it awfully tricky/time-consuming to do untargeted curation

cthoyt avatar Dec 16 '21 14:12 cthoyt

@nataled for now we need to document clearly that a new request must not conflict with anything in bioregistry, BioPortal. I can

  • [ ] Modify thew new request template with a request to manually check bioregistry & BioPortal

before we:

  • [ ] Implement automated checks as part of the dashboard

matentzn avatar Dec 16 '21 15:12 matentzn

https://obofoundry.org/id-policy.html#allocating-idspaces currently indicates to check with identifiers.org. Please confirm that this should be replaced with bioregistry and bioportal. Alternatively, the latter two can be added.

nataled avatar Dec 16 '21 16:12 nataled

See previous discussion on #1519

cthoyt avatar Dec 16 '21 20:12 cthoyt

Clear as mud! ;)

So... identifiers.org bioregistry.io bioportal.org obofoundry.org n2t.net ...and...? (I saw several others mentioned, without URLs, like PrefixCommons and BioContext)

I'm looking for a definitive list of resources (names and URLs), specifically for the lists themselves (not some upper-level landing page). In other words, the user should be able to go to the link we provide and see the list of prefixes. Failing that, some page that provides a search function.

nataled avatar Dec 16 '21 20:12 nataled

The Bioregistry imports Identifiers.org, OBO Foundry, and N2T as well as many other resources (see here for a full list), so it can be a one-stop shop for most resources. However, it does not import all of BioPortal, so users should check there too.

Web Access

  • Landing page with list of all prefixes in Bioregistry: https://bioregistry.io/registry/
  • Landing page with list of all prefixes in Bioportal: https://bioportal.bioontology.org/ontologies
  • Home page with search function for Bioregistry: https://bioregistry.io
  • Home page with search function for Bioportal: https://bioportal.bioontology.org/
  • Search API for Bioregistry: https://bioregistry.io/api/search?q=<query goes here>
  • Search API for Bioportal: not sure if this exists

Data Dumps

Bioregistry also has several full dumps

for potential contributors who want to access this information programmatically. These are all updated on a nightly basis.

Bioportal doesn't offer any first-party data dumps, but the Bioregistry generates one nightly at https://github.com/biopragmatics/bioregistry/blob/main/src/bioregistry/data/external/bioportal/raw.json

Programmatic Access

Programmatic way to check if something is in the Bioregistry:

import bioregistry

query = "EPSO"
available_in_bioregistry = bioregistry.normalize_prefix(query) is None

Programmatic way to check if something is in BioPortal:

from bioregistry.external.bioportal import get_bioportal

query = "EPSO"
bioportal_dict = get_bioportal()
available_in_bioportal = query not in bioportal_dict

cthoyt avatar Dec 16 '21 20:12 cthoyt

Perfect, thanks!

nataled avatar Dec 16 '21 20:12 nataled

@nataled I updated my comment above with more information that might be more actionable. Feel free to reuse part or all of it

cthoyt avatar Dec 16 '21 21:12 cthoyt

I have updated the documentation (which is outside the scope of this ticket). Please see this page: https://obofoundry.org/id-policy.html and look for the section Allocating IDSPACEs and the subsection Guidelines for selecting an IDSPACE

nataled avatar Dec 16 '21 21:12 nataled

@matentzn note that my changes should satisfy your first checkbox regarding updating the instructions (the template itself already points to the document I just revised; I see no need to add text to the template itself since that will just duplicate the information).

nataled avatar Dec 16 '21 21:12 nataled

This all looks great. I made an issue at the OBO dashboard to implement @cthoyt checker: https://github.com/OBOFoundry/OBO-Dashboard/issues/59

Thank you both for dealing with this! Are there any remaining action items here?

matentzn avatar Dec 17 '21 10:12 matentzn

Looks like all aspects of this have either been taken care of, or have a ticket to do so.

nataled avatar Dec 17 '21 13:12 nataled

Great! Thank everyone for your input!

matentzn avatar Dec 17 '21 13:12 matentzn

The only concern I have is with the 'strength' of this requirement, and its scope. Strength referring to dashboard report error, warn, or info. I'm certain that a clash with another Foundry ontology would be an ERROR, for example, but not so sure about clashes with non-ontology resources that might be little-known projects. Scope refers to whether or not the ontology needs to be concerned with obsolete resources. I'm not sure these aspects have been discussed.

nataled avatar Dec 17 '21 13:12 nataled

I am happy to publicise this widely, but I think bioportal and bioregistry clashes at the very least MUST be avoided moving forward.. we owe this to open science. I am happy to leave this ticket open, but I would say, if we don't get any seriously strong argument for permitting namespace clashes with existing resources, used or otherwise, I think this will be an ERROR. What about this: If we don't see counter arguments on this issue until Friday 24th December, the bioportal/bioregistry clash rule goes into OBO Law.

matentzn avatar Dec 17 '21 13:12 matentzn

It's basically written that way now, at least by interpretation. I'm not objecting or wavering, really, but I don't recall any discussion of nuances like those I mentioned. Perhaps an Ops call agenda item?

nataled avatar Dec 17 '21 14:12 nataled

Ok.

Remaining action item:

  • [ ] Operations committee to sign of on mandatory non-clash rule with BioPortal and Bioregistry (ERROR status in dashboard).

matentzn avatar Dec 17 '21 14:12 matentzn

@matentzn I'd also propose this should require a technical check that fails on a PR that has problematic content, it's always possible people miss what's in the dashboard.

cthoyt avatar Dec 17 '21 14:12 cthoyt

This is not just for this case here - I think I have a better idea for that which does not require a check. Basically, in order to pass the dashboard the whole config must be present - since its already there, we should just be able to use it instead of having ontology submitters use their own. An even better idea: We require the pull request with the metadata right from the start, even before permission - then the dashboard can just pull that - this will totally automated the OBO nor dashboard with no need for me to intervene anymore.

matentzn avatar Dec 17 '21 14:12 matentzn

Remaining action item:

  • [ ] Operations committee to sign of on mandatory non-clash rule with BioPortal and Bioregistry (ERROR status in dashboard).

We should add this to the next OBO Ops call agenda, which will be chaired by @nicolevasilevsky.

nlharris avatar Jan 26 '22 22:01 nlharris

Given that we have put this in our ID Policy here https://obofoundry.org/id-policy.html, and our NTR issue here I don't think that it needs to be put in front of OBO Ops again. It has been decided. @nataled should check that this is documented accurately enough, I think it is good enough, but some stronger wording could help.

So the only remaining item here is

  • [ ] Make an OBO dashboard ticket that checks no-clash rule.

matentzn avatar Jan 27 '22 18:01 matentzn