elabftw icon indicating copy to clipboard operation
elabftw copied to clipboard

Import from PubChem using CAS

Open alexsvcl opened this issue 11 months ago • 9 comments

Detailed description of the problem

Importing a compound via a CAS gives the main molecule (without its salt) bearing another CAS number

Expected Behavior

We expect to have the compounds corresponding to the requested CAS

Steps to reproduce the behavior

  1. Tools > Compounds > Import from Pubchem
  2. CAS number: 56392-17-7 (=Metoprolol Tartrate)
  3. Search
  4. CAS found = 51384-51-1 (=Metoprolol)

Image

What eLabFTW version are you using? Visible in bottom right of a page.

5.2.1

Do you have any idea what may have caused this?

No response

Do you have an idea how to solve the issue?

No response

Additional information

No response

alexsvcl avatar May 20 '25 13:05 alexsvcl

Hello,

Apparently PubChem will link a CAS to the parent compound. That's just how their system work. So in your case, you need to use the CID 441308 to import the Salt, instead of using the CAS, because the CAS redirects to the parent compound.

Debug info:

We get the CID (pubchem id) from the CAS by going here:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/xref/rn/56392-17-7/json

As you can see it gives CID 4171:

Image

Instead of 441308.

NicolasCARPi avatar May 20 '25 13:05 NicolasCARPi

Seconding this issue. The CAS search is unreliable, which is making it difficult to implement. Previously, I had written my own PubChem search algorithm using the pubchempy library, which did not have as many issues. I'm not sure exactly what the pcp.get_compounds() function does differently, but it didn't have as many errors (although it still had some trouble). I'll leave a short script below:

def get_compound(CAS:str) -> pcp.Compound:
    compound_list: list[pcp.Compound] = pcp.get_compounds(CAS, "name")
    if len(compound_list) > 1:
        raise ValueError(
            "Multiple compounds with this name have been found, please input a more specific name or CAS number"
        )
    elif len(compound_list) == 0:
        raise ValueError("No compound with this name has been found")
    compound: pcp.Compound = compound_list[0]
    return compound

def pull_values(searchquery: str) -> dict:
    compound: pcp.Compound = get_compound(searchquery)
    values:dict = {
        "Title_0": compound.synonyms[0],
        "Full name": compound.iupac_name,
        "SMILES": canonicalize_smiles(compound.isomeric_smiles),
        "Molecular Weight": compound.molecular_weight,
        "Pubchem Link": f"https://pubchem.ncbi.nlm.nih.gov/compound/{compound.cid}",
        "Hazards Link": f"https://pubchem.ncbi.nlm.nih.gov/compound/{compound.cid}#section=Hazards-Identification",
    }
    if not check_if_cas(searchquery):
        values.update({"CAS": find_cas(compound.synonyms)})
    return values

12buntu avatar May 20 '25 19:05 12buntu

what is pcp?

NicolasCARPi avatar May 21 '25 08:05 NicolasCARPi

Sorry, forgot to mention that, it's from the PubChemPy module: https://pubchempy.readthedocs.io/en/latest/

import pubchempy as pcp

This may be a slow solution, but since the search by CAS function pulls multiple related compound CIDs, in addition to the correct one (in the case of https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/xref/rn/56392-17-7/json, 2 is the desired compound), could you iterate through all the matches to find an exact CAS match?

12buntu avatar May 21 '25 15:05 12buntu

Notably, the pcp.get_compounds(CAS: str, "name") call searches by the synonyms field in PubChem. I called the variable CAS because ideally the user would input a CAS, but it also searches by name.

12buntu avatar May 21 '25 15:05 12buntu

I've found that adding cids to the URL gives us a list of CIDs for a given CAS:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/xref/rn/56392-17-7/cids/json

I've made a modification to import all the returned CID as compounds and link them all to the resource (during CSV import). (I don't see another way around this problem, and I feel it's better to have the good link + others that could eventually be removed, rather than have a single link to an incorrect compound?).

I've found that CID 165339218 ((2R,3R)-2,3-dihydroxybutanedioate;[2-hydroxy-3-[4-(2-methoxyethyl)phenoxy]propyl]-propan-2-ylazanium) has the same CAS (56392-17-7) as Metoprolol Tartrate, so the Metoprolol Tartrate isn't imported as compound because the CAS has a UNIQUE property. But they are the same molecule so I guess it's fine.

So the logic would be:

Get a CAS: find one or several CID from it, import each of them (skip the ones with same CAS), and link them all to the created resource (during CSV import process).

Thoughts?

NicolasCARPi avatar Jun 22 '25 11:06 NicolasCARPi

Linking multiple compounds to a resource sounds like a good solution (especially if there is some tag/list to flag these for manual review), but I am a little confused about the logic you specified at the bottom, specifically the "skip the ones with same CAS" part. If I am understanding you correctly, it would find several CIDs associated with a given CAS, import all of the CIDs with unique CAS numbers, and associate all of them with that resource, so that the resource has multiple CIDs associated with it, each with a unique CAS. I don't understand why you would skip the ones with the same CAS. If I enter a CAS for a resource, I only want CIDs with that exact CAS associated with the resource, so I don't understand the benefit of skipping entries with the same CAS, and associating multiple entries with different CAS numbers to a resource.

12buntu avatar Jun 23 '25 15:06 12buntu

it would find several CIDs associated with a given CAS

Yes, and then give you a choice to import them all or just the one you want. Work in progress screenshot (imagine buttons next to each to import it, or checkboxes):

Image

I don't understand why you would skip the ones with the same CAS

Because the compounds table has the cas column marked as unique, which makes it impossible to import two compounds with the same CAS.

What I don't understand is why the CAS number, supposedely unique, can point to several things. Maybe I should remove the CAS number is UNIQUE constraint on the compounds table then. Because it seems I misunderstood what a CAS number is supposed to represent. Yet, all the documentation on this number mention the fact it is UNIQUE.

If I enter a CAS for a resource, I only want CIDs with that exact CAS associated with the resource

As you can see in the above screenshot, some of them have the same CAS but are still different compounds, and some of them have no CAS. The thing is that I'm also bound by the results sent by the PubChem API. Leaving the choice of what to import to the end user seems a suitable solution.

If someone has a better idea, please let me know.

NicolasCARPi avatar Jun 23 '25 20:06 NicolasCARPi

I'm agree it seems to be the best solution.

But finally CAS corresponds to a unique "substance" that could also be a drug or commercial mixture/formulation. Meanwhile CIDs correspond to "pur" compound. The substance metoprolol tartrate is a mixture a metoprolol and tartrate salt, and it is exactly what elab can do.

I don't know if it is possible to:

  • Search a CAS
  • Find all related CIDs
  • (do not import these CIDs)
  • Request to find all components present in all CIDs entry
  • Import only components

Here it will give:

  • metoprolol
  • tartrate (D, L, racemic)
  • tartaric acid (D..) And not all other substances which could be drugs or commercial compounds

The end user may still remove pur component if needed. 😅

alexsvcl avatar Jun 23 '25 21:06 alexsvcl

I am not sure I understand how it gets off. Some time ago I wrote a double search tool in python for pubchem, put on PiPy as pubchemTools. For this example it gives:

>>> from pubchemTools import Pubchem
>>> a = Pubchem('56392-17-7')
>>> a.cas
['56392-17-7']
>>> a.name
'Metoprolol Tartrate'
>>> a.cid
441308

all I remember it was as straightforward as you can get.

tomio13 avatar Aug 14 '25 14:08 tomio13

See my comment about switching the xref value: https://github.com/elabftw/elabftw/discussions/5853#discussioncomment-14105387

NicolasCARPi avatar Aug 14 '25 15:08 NicolasCARPi