mychem.info icon indicating copy to clipboard operation
mychem.info copied to clipboard

DataTransform not joining some documents that belong together

Open ravila4 opened this issue 2 years ago • 1 comments

I have found several documents from aeolus, unii, and ginas that belong together with documents from chembl/pubchem via primary key. For example, http://mychem.info/v1/chem/22T8Z09XAK and http://mychem.info/v1/chem/XNCKCDBPEMSUFA-UHFFFAOYSA-N both refer to the same entity and should be joined.

I think that the datatransform graph is missing some important links. This is the current graph of connections provided by MyChem's keylookup module. Note that links are missing for the drugcentral and rxnorm nodes.

mychem_graph .

In the example above, the two documents could be linked by a via aeolus.unii, aeolus.rxnorm or unii.unii to drugcentral.unii or drugcentral.rxnorm.

Additionally, parsers, such as Drugcentral's which perform id resolution in the parser could benefit from offloading this steps to the datatransform module. For example, this is the current code that Drugentral uses to determine the primary id for documents without inchikey: https://github.com/biothings/mychem.info/blob/e7c32479e1263a036c2f8c45fbe92c878b32c500/src/hub/dataload/sources/drugcentral/drugcentral_parser.py#L161-L185

In the code above, the parser is running requests against the live MyChem database. It would be better to deal with resolution without depending on external requests.

ravila4 avatar Jan 11 '22 21:01 ravila4