boltz icon indicating copy to clipboard operation
boltz copied to clipboard

CCD ligands have wrong isoforms

Open ChenfuShi opened this issue 2 months ago • 3 comments

Hello, As with others before me, I have found that ligand prediction can sometimes have inconsistent isoform depending on the ligand being defined with a SMILES or a CCD id. I think I have found a possible root cause to the issue in at least some instances, which seems to be related to a dependency used during the generation of the ligand database. This could affect both the prediction and training of models.

Basically, the CCD database is generated using the CIF files using the script https://github.com/jwohlwend/boltz/blob/main/scripts/process/ccd.py This script makes use of the package pdbeccdutils, which loads the molecule in 3D, and then creates the chiral centers by calling the function AssignStereochemistryFrom3D. However, this function is not working correctly, possibly when chiral centers have electron pairs that don't form a bond, but I'm not sure if that's the reason.

Let's take as an example SAM. The sulfur group has 3 bonds, but it's a chiral center and it assumes a tetrahedral conformation. If we look at the PDB page for SAM we find that the conformation is the following:

Image

With the sulfur having the (S) form. This is well known in the literature as the (R) form is not functional and eliminated.

However, when we load this exact molecule into pdbeccdutils from the CIF file we find that this reads the center incorrectly and it assigns it a (R) conformation:

Image

This is even though the CIF file is definitely loaded correctly.

I'm going to open an issue on the pdbeccdutils repository as well, as I imagine other tools are also using this dependency, and it might affect a lot of structures.

ChenfuShi avatar Oct 06 '25 13:10 ChenfuShi

I think you should run AssignCIPLabels first on your component.component.mol. It seems like the read_pdb_cif_file does its own sanitization, and maybe didn't call that properly? However, the underlying SMILES are all the same, so all that seems wrong here are the _CIPCode, which I don't think is relied upon in the modeling.

from rdkit import Chem
from rdkit.Chem.rdCIPLabeler import AssignCIPLabels

from pdbeccdutils.core import ccd_reader

mol = Chem.MolFromMolFile("SAM_ideal.sdf")
AssignCIPLabels(mol)
print(Chem.MolToSmiles(mol))
for atom in mol.GetAtoms():
    if atom.GetSymbol() != "S":
        continue
    assert atom.GetProp("_CIPCode") == "S"

mol = Chem.MolFromMolFile("SAM_ideal.sdf")
Chem.rdmolops.AssignStereochemistryFrom3D(mol)
AssignCIPLabels(mol)
print(Chem.MolToSmiles(mol))
for atom in mol.GetAtoms():
    if atom.GetSymbol() != "S":
        continue
    assert atom.GetProp("_CIPCode") == "S"

cif = ccd_reader.read_pdb_cif_file("SAM.cif")
mol = cif.component.mol
before_assign = Chem.MolToSmiles(mol)
for atom in mol.GetAtoms():
    if atom.GetSymbol() != "S":
        continue
    assert atom.GetProp("_CIPCode") == "R"  # what you pointed out

AssignCIPLabels(mol)
after_assign = Chem.MolToSmiles(mol)
assert before_assign == after_assign  # but note, smiles never changed
print(after_assign)
for atom in mol.GetAtoms():
    if atom.GetSymbol() != "S":
        continue
    assert atom.GetProp("_CIPCode") == "S"

BTW chemdraw recognizes it as S too

Image

pechersky avatar Oct 06 '25 16:10 pechersky

Not sure then, basically if I try to generate a structure with boltz by giving the CCD code it does model it wrong, and uses the R chirality rather than the S chirality. Do you have any other guesses as to why the chirality is predicted wrong then?

ChenfuShi avatar Oct 07 '25 12:10 ChenfuShi

@pechersky Hi, I have a followup on this, So digging deeper through the boltz code I found that when a molecule is loaded in schema.py the chiral centers are determined in the computer_chiral_atom_constraints here: https://github.com/jwohlwend/boltz/blob/cb04aeccdd480fd4db707f0bbafde538397fa2ac/src/boltz/data/parse/schema.py#L345

So let's take the SAM molecule again, we can take from the pre-computed database:

import pickle
database = pickle.load(open("/home/chenfushi/boltz/.boltz/ccd.pkl", "rb"))

mol = database["SAM"]

If we run this function as used in boltz we get the wrong chiral form (R) in the Sulfur, which is atom 7

from rdkit.Chem import AllChem as Chem

print(Chem.FindMolChiralCenters(mol, includeUnassigned=False))

[(1, 'S'), (7, 'R'), (10, 'S'), (12, 'S'), (14, 'R'), (16, 'R')]

Notably this happens even if we call AssignCIPLabels. This is because by default force is set to True, and this will overwrite previously assigned labels.

So the way I found to make this function return the correct chiral center si to call AssignCIPLabels, then set the force to false:

from rdkit.Chem import AllChem as Chem
from rdkit.Chem.rdCIPLabeler import AssignCIPLabels
AssignCIPLabels(mol)

print(Chem.FindMolChiralCenters(mol, includeUnassigned=False, force=False))

[(1, 'S'), (7, 'S'), (10, 'S'), (12, 'S'), (14, 'R'), (16, 'R')]

Making these two changes in the boltz code, finally allows it to predict the right isoform. Personally I've added the AssignCIPLabels(mol) call within the get_mol function.

ChenfuShi avatar Oct 09 '25 09:10 ChenfuShi

@pechersky Hi, I’ve encountered a related issue. When generating a DNA chain containing an engineered residue (CCD code GF2), boltz outputs an isoform of GF2. Surprisingly, the problem is not just incorrect chirality—the topology itself is wrong.

Here is the GF2 structure predicted by boltz (F atom highlighted): Image

And here is the correct topology of GF2 (F atom highlighted): Image

I suspect the CCD code might be mapped to a different structure incorrectly. I can attach a reproduction config if needed.

wxx07 avatar Dec 06 '25 07:12 wxx07