OBOFoundry.github.io icon indicating copy to clipboard operation
OBOFoundry.github.io copied to clipboard

Add `github_date_added` field to ontology metadata

Open cthoyt opened this issue 2 years ago • 7 comments

Closes #1967

What: This PR crawls the git history from this repo to figure out when each ontology metadata file was created, then annotates it into the frontmatter of the ontology metadata. This is applicable to inactive, active, orphaned, and obsolete ontologies. This information is purely for technical purposes, and not meant to be displayed on the website. This PR also adds the corresponding field to the metadata JSON schema for validation purposes.

Why: In order to exert more strict standards on new ontologies, it makes sense to have a way to not have to retroactively apply them to old ontologies (which might not be able to update in a timely manner before imposing new standards). Therefore, all new OBO Foundry standards can be optionally tagged with the date when they go active, and ontologies added before that date don't necessarily have to conform.

How

  • [x] Add script that can populate the data (see https://github.com/OBOFoundry/OBOFoundry.github.io/blob/63cdd1293606248e62e1cea755168aae2e2a34eb/util/add_dates.py)
  • [x] Update the metadata schema (see https://github.com/cthoyt/OBOFoundry.github.io/blob/63cdd1293606248e62e1cea755168aae2e2a34eb/util/schema/registry_schema.json#L21-L26)
  • [x] Pull the trigger (will create a big diff for this PR)
  • [x] Make explicitly clear that this is not the same thing as the date ontologies were accepted into OBO Foundry (see https://github.com/cthoyt/OBOFoundry.github.io/blob/7fdface2c60757ee680f63264adb35aaff980df5/util/schema/registry_schema.json#L25)

Note, for some reason, this script failed to add the metadata for a few ontologies, which I then did manually:

ERROR	AISM added: 'added' is a required property
ERROR	APOLLO_SV added: 'added' is a required property
ERROR	EPIO added: 'added' is a required property
ERROR	FIDEO added: 'added' is a required property
ERROR	NCIT added: 'added' is a required property
ERROR	OMO added: 'added' is a required property
ERROR	OOSTT added: 'added' is a required property
ERROR	XLMOD added: 'added' is a required property

cthoyt avatar Jun 17 '22 15:06 cthoyt

Tentative merge date: Friday 24th June.

matentzn avatar Jun 17 '22 15:06 matentzn

If anyone wants to do some archaeology, I am attaching @selewis's email with all the ontologies in the original OBO, sent March 2003.

gobo.txt

Or if anyone can figure out how to get CVS archives from sourceforge we could mine that too.

cmungall avatar Jun 21 '22 14:06 cmungall

If anyone wants to do some archaeology, I am attaching @selewis's email with all the ontologies in the original OBO, sent March 2003.

gobo.txt

Or if anyone can figure out how to get CVS archives from sourceforge we could mine that too.

I looked at this for 60 seconds and I was properly scared

cthoyt avatar Jun 21 '22 15:06 cthoyt

To make it even more scary, you could https://web.archive.org/web/*/obofoundry.org

matentzn avatar Jun 21 '22 15:06 matentzn

From my understanding, the two requests for changes were either to

  1. Manually figure out the dates that the pre-GitHub OBO Foundry ontologies were added/accepted into OBO as well as resolve all file renames to the added date of their original names
  2. Leave the added field either blank or some sentinel value that isn't a date for ontologies added on 2015-07-28 (the date of creation of this GitHub repo)

The problem with Option 1 is that the file Chris attached in https://github.com/OBOFoundry/OBOFoundry.github.io/pull/1969#issuecomment-1161836551 is very difficult to read through and after trying unsuccessful to download the OBO Sourceforge with SVN, I determined that this option was too much effort. The problem with Option 2 is that making this field optional means that there is no way to test its integrity - new ontologies would simply omit the field, then we would not be any further than before.

What I explicitly want to capture is the date on which the file was created on GitHub. This is a good enough proxy for date added to OBO Foundry for the purposes of enabling us to apply potential new metadata standards based on the date the ontology is added, i.e., exert stricter standards on new ontologies while not forcing old ones to make updates.

Rather than making a technical solution based on options 1 and 2, I opted to update the way this improvement is communicated. In https://github.com/OBOFoundry/OBOFoundry.github.io/pull/1969/commits/9ee81798869e23863316c805b2329ebe696dc291, I renamed the field added to github_date_added and further updated the entry in the metadata schema to explain that this isn't the same thing as date added to the OBO Foundry. Therefore, this field explicitly reflects when the file was added to GitHub, and not necessarily the date added to OBO Foundry. This is good enough for what I want to be able to accomplish, and is canonically correct.

Nico mentioned that this could be calculated on-the-fly using the same git command that I encoded in https://github.com/cthoyt/OBOFoundry.github.io/blob/7fdface2c60757ee680f63264adb35aaff980df5/util/add_dates.py#L20, but this would only work if the data from the repository is in a git context (i.e., what if we want to consume the data directly, what if it gets put into a python package..), so I think that having this explicit is still important.

cthoyt avatar Jun 22 '22 09:06 cthoyt

As long as the documentation goes to great pains to make this clear, so no one is out there assuming it means date ontology added (as opposed to the date some file was added to some repo), then I suppose I am ok with that.

On Wed, Jun 22, 2022 at 5:49 AM Nico Matentzoglu @.***> wrote:

@.**** commented on this pull request.

I am fine with this, but lets make sure we get at least one or two others to chime in.

Remember all: this information is purely for technical purposes, not for display on the website.

— Reply to this email directly, view it on GitHub https://github.com/OBOFoundry/OBOFoundry.github.io/pull/1969#pullrequestreview-1014869592, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJR55QZVRYNT5GQNVHMABTVQLOTJANCNFSM5ZCS3WOQ . You are receiving this because you are subscribed to this thread.Message ID: @.*** com>

hoganwr avatar Jun 22 '22 12:06 hoganwr

Damion: Being part of OBO involves some work over the years. I like an annotation that indicates when their ontology was added to the OBO Library.
Use empty value OR bogus start date - 1000-01-01 (whatever is best for development. Will make an issue vote.

Allowance for ontologies that can't use GitHub (But Charlie's aim: when was metadata record to OBO Library).

ddooley avatar Jul 12 '22 16:07 ddooley

Want to salvage anything from this PR @cthoyt ?

matentzn avatar Nov 15 '22 14:11 matentzn

Want to salvage anything from this PR @cthoyt ?

Despite getting #2146 merged in, the only way forward is to have a fully complete field across all ontologies that says when they were added so we can progressively add more strict standards for newer ontologies, so the idea in this PR isn't done yet

cthoyt avatar Nov 15 '22 15:11 cthoyt

as of #2277, there's a more simple way of adding new checks that only apply to new ontologies going forward, so I'm abandoning this PR.

cthoyt avatar Jan 29 '23 21:01 cthoyt