famplex icon indicating copy to clipboard operation
famplex copied to clipboard

Add identifiers to all FamPlex entries

Open cthoyt opened this issue 6 years ago • 2 comments

Closes #86

This PR adds a script, famplex/reorganize_old_resources.py that migrates the old resource TSVs into the famplex/resources/ folder. It moves the code in the exports/ folder to famplex/export/ and refactors code necessary to cope with the new format changes.

Identifiers:

  • Identifiers will match \d{6}
  • With the addition of identifiers, it now no longer makes sense to maintain the entities.csv in a label-sorted order, but rather to use the identifiers to sort.

Format changes:

  • ~All CSV files were switched to TSV. I'm not dogmatic about this, but it makes it much easier to deal with funny characters~ After reading the README one more time, I think it's better just to stick with CSV A descriptions.csv file was introduced for now, but it will be jointly upgraded with entities.csv to famplex/resources/entities.tsv which has the identifier, name, description, and references.
  • grounding_map.csv has been upgraded to famplex/resources/grounding_map.tsv and will no longer be a wide CSV, but a tall and skinny TSV with four columns: text, database, database identifier, and name. This means one text reference may appear on many lines.
  • equivalence.csv has been upgraded to famplex/resources/equivalence.tsv and now has both labels and identifiers.
  • relations.csv has been upgraded to famplex/resources/relations.tsv and now has both identifiers and labels for subject and object

Features:

  • The check_references.py script now checks both that the HGNC identifier is valid and the associated names are up to date using indra.databases.hgnc_client
  • The check_references.py script now checks for text in the grounding map that has multiple grounds to different entities in the same database. After some investigation, I suppose this sort of makes sense because one textual reference could be disambiguated to multiple entities. @steppi and @bgyori you should take a look at this, and consider how the grounding map in this repository fits into your current entity disambiguation work, and whether it even makes sense to store this kind of information here (in the famplex repo) anymore. Here's the current (short) output:
WARNING "B1" has multiple UP groundings: UP:O43157 ! PLXB1_HUMAN, UP:P14635 ! CCNB1_HUMAN
WARNING "cell motility" has multiple GO groundings: GO:GO:0016477 ! cell migration, GO:GO:0048870 ! cell motility
WARNING "MA" has multiple PUBCHEM groundings: PUBCHEM:10836 ! , PUBCHEM:73659 ! 
WARNING "MA" has multiple CHEBI groundings: CHEBI:CHEBI:6809 ! methamphetamine, CHEBI:CHEBI:66682 ! maslinic acid
WARNING "OA" has multiple CHEBI groundings: CHEBI:CHEBI:37659 ! oleanolic acid, CHEBI:CHEBI:44658 ! okadaic acid
WARNING "OA" has multiple PUBCHEM groundings: PUBCHEM:10494 ! , PUBCHEM:446512 ! 
WARNING "p42" has multiple UP groundings: UP:P28482 ! MK01_HUMAN, UP:Q9UQ80 ! PA2G4_HUMAN

Exporter changes:

  • BEL Namespace: none
  • Relations graph: now shows both identifier and label for each entity
  • OBO: fix major problem with OBO export related to locally defined relationships (they need a [Typedef] stanza). Added identifiers where possible and used OBO syntax to keep showing names when available. Added description and references for each entity when available. Renamed root entity to "famplex".

To Do and Questions:

  • @bgyori is export/hgnc_ids.py still necessary?
  • Change all names of DBs that don't use Identifiers.org to use identifiers.org. This might be an issue for INDRA, so we need to discuss.
  • Update importer scripts. Which ones are still relevant?

cthoyt avatar Jul 21 '19 12:07 cthoyt

Wow, that's a serious PR! I like a lot of things about this, a couple of points to discuss:

  • Several downstream dependencies pull FamPlex content from GitHub via URLs and the expected location of files. These would break if the resource files are moved into another folder so we'd need to track down all these places and update them.
  • Obviously the same is true for the breaking changes in the structure of each of the resource files - lots of downstream code, potentially by users we don't know about would need to be examined and updated.
  • While perhaps necessary for making it a proper Python package (though wouldn't a setup.py be able to move files from the top level to famplex/resources upon install?) it is nice to have the main, human-editable resource files at the top level of the repo so it's very easy to e.g, edit it manually on Github.
  • By changing to IDs, we would lose the human-readability of entries in the namespace
  • I assume that it's not recommended to redefine the existing identifiers.org entry for fplx (https://registry.identifiers.org/registry/fplx) which uses standard names. I wonder if we go with using numeric IDs, whether the right course of action is to register another name space like fplx.id in addition to the existing one.
  • I wouldn't want to flatten the grounding_map's structure because the point of having multiple entries per row is that each row corresponds to one "sense" of an entity. With the flattened structure it isn't possible to represent ambiguity among multiple senses.
  • The OBO export was designed (through some trial&error) to be compatible with TRIPS and BioPortal simultaneously (https://bioportal.bioontology.org/ontologies/FPLX). We'd have to make sure that the new format still works with these (I'm almost sure the changes won't work out of the box with TRIPS so again that would require some adaptation).

bgyori avatar Jul 30 '19 03:07 bgyori

Wow, that's a serious PR! I like a lot of things about this, a couple of points to discuss:

  • Several downstream dependencies pull FamPlex content from GitHub via URLs and the expected location of files. These would break if the resource files are moved into another folder so we'd need to track down all these places and update them.

I agree this is a problem, it wouldn't be a bad idea to do an audit on what's using FamPlex and where. For example, it might make sense to import a FamPlex package that already is a "famplex_client" for use in INDRA and Gilda.

  • Obviously the same is true for the breaking changes in the structure of each of the resource files - lots of downstream code, potentially by users we don't know about would need to be examined and updated.

Martin always joked that a good way to determine how much a web service is used is to to turn it off and see how long it takes for someone to email and complain. Maybe that strategy would work here?

We could make a tag/release before making any big changes so users could refer to the repository by that hash in the mean time.

  • While perhaps necessary for making it a proper Python package (though wouldn't a setup.py be able to move files from the top level to famplex/resources upon install?) it is nice to have the main, human-editable resource files at the top level of the repo so it's very easy to e.g, edit it manually on Github.

I've spent an inordinate amount of time (several sessions spanning several hours of digging through the rabbit hole that is setuputils, distutils, etc.) on this and never come up with a solution. I would love to solve the problem how you propose! Maybe you have some ideas, or at least have a fresh pair of eyes.

  • By changing to IDs, we would lose the human-readability of entries in the namespace
  • I assume that it's not recommended to redefine the existing identifiers.org entry for fplx (https://registry.identifiers.org/registry/fplx) which uses standard names. I wonder if we go with using numeric IDs, whether the right course of action is to register another name space like fplx.id in addition to the existing one.

Good point. Again, I wonder how many users are currently depending on resolving through identifiers. Maybe we could get away with switching the current entry to something different

  • I wouldn't want to flatten the grounding_map's structure because the point of having multiple entries per row is that each row corresponds to one "sense" of an entity. With the flattened structure it isn't possible to represent ambiguity among multiple senses.

Okay! Will revert that.

  • The OBO export was designed (through some trial&error) to be compatible with TRIPS and BioPortal simultaneously (https://bioportal.bioontology.org/ontologies/FPLX). We'd have to make sure that the new format still works with these (I'm almost sure the changes won't work out of the box with TRIPS so again that would require some adaptation).

Sounds like this will be an ongoing issue... Currently the OBO isn't compatible with OLS.

Overall, do you think we should split this PR into a couple smaller ones? Which ones do you @bgyori @johnbachman think are the lowest hanging fruits that we can address first?

cthoyt avatar Aug 10 '19 11:08 cthoyt