CCD ligands have wrong isoforms
Hello, As with others before me, I have found that ligand prediction can sometimes have inconsistent isoform depending on the ligand being defined with a SMILES or a CCD id. I think I have found a possible root cause to the issue in at least some instances, which seems to be related to a dependency used during the generation of the ligand database. This could affect both the prediction and training of models.
Basically, the CCD database is generated using the CIF files using the script https://github.com/jwohlwend/boltz/blob/main/scripts/process/ccd.py This script makes use of the package pdbeccdutils, which loads the molecule in 3D, and then creates the chiral centers by calling the function AssignStereochemistryFrom3D. However, this function is not working correctly, possibly when chiral centers have electron pairs that don't form a bond, but I'm not sure if that's the reason.
Let's take as an example SAM. The sulfur group has 3 bonds, but it's a chiral center and it assumes a tetrahedral conformation. If we look at the PDB page for SAM we find that the conformation is the following:
With the sulfur having the (S) form. This is well known in the literature as the (R) form is not functional and eliminated.
However, when we load this exact molecule into pdbeccdutils from the CIF file we find that this reads the center incorrectly and it assigns it a (R) conformation:
This is even though the CIF file is definitely loaded correctly.
I'm going to open an issue on the pdbeccdutils repository as well, as I imagine other tools are also using this dependency, and it might affect a lot of structures.
I think you should run AssignCIPLabels first on your component.component.mol. It seems like the read_pdb_cif_file does its own sanitization, and maybe didn't call that properly? However, the underlying SMILES are all the same, so all that seems wrong here are the _CIPCode, which I don't think is relied upon in the modeling.
from rdkit import Chem
from rdkit.Chem.rdCIPLabeler import AssignCIPLabels
from pdbeccdutils.core import ccd_reader
mol = Chem.MolFromMolFile("SAM_ideal.sdf")
AssignCIPLabels(mol)
print(Chem.MolToSmiles(mol))
for atom in mol.GetAtoms():
if atom.GetSymbol() != "S":
continue
assert atom.GetProp("_CIPCode") == "S"
mol = Chem.MolFromMolFile("SAM_ideal.sdf")
Chem.rdmolops.AssignStereochemistryFrom3D(mol)
AssignCIPLabels(mol)
print(Chem.MolToSmiles(mol))
for atom in mol.GetAtoms():
if atom.GetSymbol() != "S":
continue
assert atom.GetProp("_CIPCode") == "S"
cif = ccd_reader.read_pdb_cif_file("SAM.cif")
mol = cif.component.mol
before_assign = Chem.MolToSmiles(mol)
for atom in mol.GetAtoms():
if atom.GetSymbol() != "S":
continue
assert atom.GetProp("_CIPCode") == "R" # what you pointed out
AssignCIPLabels(mol)
after_assign = Chem.MolToSmiles(mol)
assert before_assign == after_assign # but note, smiles never changed
print(after_assign)
for atom in mol.GetAtoms():
if atom.GetSymbol() != "S":
continue
assert atom.GetProp("_CIPCode") == "S"
BTW chemdraw recognizes it as S too
Not sure then, basically if I try to generate a structure with boltz by giving the CCD code it does model it wrong, and uses the R chirality rather than the S chirality. Do you have any other guesses as to why the chirality is predicted wrong then?
@pechersky Hi, I have a followup on this, So digging deeper through the boltz code I found that when a molecule is loaded in schema.py the chiral centers are determined in the computer_chiral_atom_constraints here: https://github.com/jwohlwend/boltz/blob/cb04aeccdd480fd4db707f0bbafde538397fa2ac/src/boltz/data/parse/schema.py#L345
So let's take the SAM molecule again, we can take from the pre-computed database:
import pickle
database = pickle.load(open("/home/chenfushi/boltz/.boltz/ccd.pkl", "rb"))
mol = database["SAM"]
If we run this function as used in boltz we get the wrong chiral form (R) in the Sulfur, which is atom 7
from rdkit.Chem import AllChem as Chem
print(Chem.FindMolChiralCenters(mol, includeUnassigned=False))
[(1, 'S'), (7, 'R'), (10, 'S'), (12, 'S'), (14, 'R'), (16, 'R')]
Notably this happens even if we call AssignCIPLabels. This is because by default force is set to True, and this will overwrite previously assigned labels.
So the way I found to make this function return the correct chiral center si to call AssignCIPLabels, then set the force to false:
from rdkit.Chem import AllChem as Chem
from rdkit.Chem.rdCIPLabeler import AssignCIPLabels
AssignCIPLabels(mol)
print(Chem.FindMolChiralCenters(mol, includeUnassigned=False, force=False))
[(1, 'S'), (7, 'S'), (10, 'S'), (12, 'S'), (14, 'R'), (16, 'R')]
Making these two changes in the boltz code, finally allows it to predict the right isoform. Personally I've added the AssignCIPLabels(mol) call within the get_mol function.
@pechersky Hi, I’ve encountered a related issue. When generating a DNA chain containing an engineered residue (CCD code GF2), boltz outputs an isoform of GF2. Surprisingly, the problem is not just incorrect chirality—the topology itself is wrong.
Here is the GF2 structure predicted by boltz (F atom highlighted):
And here is the correct topology of GF2 (F atom highlighted):
I suspect the CCD code might be mapped to a different structure incorrectly. I can attach a reproduction config if needed.