hetionet icon indicating copy to clipboard operation
hetionet copied to clipboard

Provenance for some CbP relationships is lost during condense() operation

Open cthoyt opened this issue 4 years ago • 3 comments

I'm going through metadata in the compound-binds-gene relationships, and taking a specific look at the actions lists. In many examples, there are several actions, such as with drugbank:DB00502 binds ncbigene:1813. In the JSON GZ export, there are two actions listed: ['antagonist', 'inverse agonist']. I made a query to the Neo4j instance to confirm this is also true there:

MATCH p=(s:Compound)-[r:BINDS_CbG]->(t:Gene)
WHERE s.identifier = 'DB00502' and t.identifier = 1813
RETURN p
LIMIT 25

However, on DrugBank I could only find the antagonist label. Is it the case that the DrugBank source data that gets parsed and converted in Hetionet contains extra information that doesn't make it to the web page I linked? If so, do you have any idea on how they pick which of many gets displayed?

cthoyt avatar Feb 10 '20 17:02 cthoyt

Is it the case that the DrugBank source data that gets parsed and converted in Hetionet contains extra information that doesn't make it to the web page I linked?

Hetionet uses DrugBank version 4.2 as processed in dhimmel/drugbank. In the past when https://www.drugbank.ca was displaying data version 4.2, I think what you would see there would be the same as what we extract from the corresponding drugbank.xml.

In the case of Haloperidol-binds-DRD2, I think these actions are coming from ChEMBL not DrugBank. Notice the following edge property:

  • sources: ChEMBL,DrugBank (target),DrugCentral (ChEMBL)
  • urls: https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL1200986, https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL54

If you go to either of the ChEMBL URLs above you'll see the following table, which does contain "inverse agonist" (link)

image

For reference, we combined multiple sources of Compound-binds-Gene relationships in the CbG-binding.ipynb notebook.

dhimmel avatar Feb 14 '20 17:02 dhimmel

Okay, so I will interpret an edge with many actions as all of them being separately true, even if there are some conflicts. I was looking through the Jupyter notebook at it seems that the actions and sources lists available are generated using the following code block

def condense(df):
    """Combine gene-compound relationships"""
    row = pandas.Series()
    row['sources'] = set(itertools.chain.from_iterable(df.sources))
    row['pubmed_ids'] = set(itertools.chain.from_iterable(df.pubmed_ids))
    row['actions'] = set(itertools.chain.from_iterable(df.actions))
    row['affinity_nM'] = df.affinity_nM.mean(skipna=True)
    row['license'] = get_license(row['sources'])
    row['urls'] = set(itertools.chain.from_iterable(df.urls))
    return row

so the information about which action comes from which source is not maintained. As far as I know, the neo4j schema is a bit limiting to having JSON/dictionary objects as the values, but it would be nice to be able to figure out from the final data what the provenance for each relationship was. Maybe a data structure that would be appropriate would be parallel lists, at the cost of being a bit repetitive.

cthoyt avatar Feb 23 '20 11:02 cthoyt

the neo4j schema is a bit limiting to having JSON/dictionary objects as the values

Referencing https://stackoverflow.com/a/38026494/4651668. I don't remember whether this limitation influenced how I encoded these properties. "parallel lists" or json-encoded text could be a sufficient workaround.

it would be nice to be able to figure out from the final data what the provenance for each relationship was

Yeah. Good lesson for the future. If need be, we could potentially create some sort of mapping from neo4j relationship id to full provenance info for CbP edges. Not as good as having it in the database, but hopefully an acceptable workaround?

dhimmel avatar Mar 02 '20 16:03 dhimmel