rdkit-scripts icon indicating copy to clipboard operation
rdkit-scripts copied to clipboard

Error parsing PDBQT to Mol: Element 'G' not found

Open linminhtoo opened this issue 2 years ago • 4 comments

hello Dr Pavel, I chanced across your repo and found it useful to parse docked pdbqt files back to RDKit mol for further analysis. However, sometimes the pdbqt2mol() function fails as RDKit is not happy with the "G" atom type. https://github.com/DrrDom/rdkit-scripts/blob/72dbd7368a09d625ba01132d4119488a676380cf/vina_dock.py#L200

giving:

****
Post-condition Violation
Element 'G' not found
Violation occurred on line 93 in file /project/build/temp.linux-x86_64-cpython-310/rdkit/Code/GraphMol/PeriodicTable.h
Failed Expression: anum > -1
****

This "G" atom type is generated by meeko at macrocycles during ligand preparation for docking with Autodock Vina, see: https://github.com/forlilab/Meeko/issues/10

I saw in your in-line code comments about this issue, but I didn't manage to modify the fix_pdbqt() function successfully: https://github.com/DrrDom/rdkit-scripts/blob/72dbd7368a09d625ba01132d4119488a676380cf/vina_dock.py#L169

Please note that I didn't use your script to run docking; it was prepared slightly differently. I am just using your code to parse back the docked .pdbqt files I also saw your comment on build_macrocycle=: https://github.com/DrrDom/rdkit-scripts/blob/72dbd7368a09d625ba01132d4119488a676380cf/vina_dock.py#L151

Do you have any idea how to fix this issue? I've attached a sample offending .pdbqt file: https://gist.github.com/linminhtoo/5949437ae066fdd136709971dcc36220#file-bad-pdbqt-L26-L27 As you can see, some lines have either "G" (macrocycle), and also "CG0" (not sure if this will also cause problems). I tried brute forcing by replacing "G" with "C" but the template bond order assignment step failed (seems RDKit fails to parse the ring correctly)

Thanks, Min Htoo

linminhtoo avatar Sep 07 '22 07:09 linminhtoo

Hello!

try this:

def fix_pdbqt(pdbqt_block):
    pdbqt_fixed = []
    for line in pdbqt_block.split('\n'):
        if not line.startswith('HETATM') and not line.startswith('ATOM'):
            pdbqt_fixed.append(line)
            continue
        atom_type = line[12:16].strip()
        if atom_type == 'G':
            atom_type = 'C'
        # autodock vina types
        if 'CA' in line[77:79]: #Calcium is exception
            atom_pdbqt_type = 'CA'
        else:
            atom_pdbqt_type = re.sub('D|A', '', line[77:80]).strip() # can add meeko macrocycle types (G and \d (CG0 etc) in the sub expression if will be going to use it
            if 'G' in atom_pdbqt_type: # in ['G','CG0','G0']:
                atom_pdbqt_type = 'A'

        if re.search('\d', atom_type[0]) or len(atom_pdbqt_type) == 2: #1HG or two-letter atom names such as CL,FE starts with 13
            atom_format_type = '{:<4s}'.format(atom_type)
        else: # starts with 14
            atom_format_type = ' {:<3s}'.format(atom_type)
        line = line[:12] + atom_format_type + line[16:]
        pdbqt_fixed.append(line)

    return '\n'.join(pdbqt_fixed)
    
pdbqt_fixed = fix_pdbqt(pdbqt_block)
mol = Chem.MolFromPDBBlock('\n'.join([i[:66] for i in pdbqt_fixed.split('MODEL')[1].split('\n')]), removeHs=False)

To parse PDBQT block by RDKit you actually don't need atom_pdbqt_type (symbols which start from 77 position) since we only need up to 66 symbols to read PDBQT as PDB block ( Chem.MolFromPDBBlock('\n'.join([i[:66] for i in pdbqt_fixed.split('MODEL')[1].split('\n')]), removeHs=False, sanitize=False)). So in your case you can easily just ignore the symbols from 77 position and change G to C in the atom type column ([12:16]). In our code we use atom_pdbqt_type only to properly recognize atom symbol (atom name can contain different symbols or numbers) to get the correct formatting because atom type of elements such as Fe, Cu starts from 13 position and others does from 14 but for some reason OpenBabel ignores sometimes such rule (by the way it might be that authors of meeko have already fixed this issue https://github.com/forlilab/Meeko/issues/3 but when we tried it the above mentioned solution was just easy to use).

I hope it will help you!

avnikonenko avatar Sep 08 '22 14:09 avnikonenko

Hello @avnikonenko

thanks for your prompt response.

I tried what you suggested, and now RDKit can parse the molecule from PDBBlock. Still, it fails when I try to assign the bond orders (see below for the error message). I found out that in the parsed molecule, a 7-membered ring broke incorrectly.

This is the parsed molecule before assigning bond orders, note the 7-membered ring on the left has broken: image Here's the template mol, parsed from the SMILES, note the 7-membered ring on the right. image

error message:

Traceback (most recent call last):
  File "/tmp/nix-shell.D8HiWB/ipykernel_75125/101573197.py", line 22, in <cell line: 11>
    mol = AllChem.AssignBondOrdersFromTemplate(template_mol, mol)
  File "/nix/store/rcsd28zlf6mwfw5yx5a921id2hzmj4ab-python3.10-rdkit-pypi-2022.3.5/lib/python3.10/site-packages/rdkit/Chem/AllChem.py", line 438, in AssignBondOrdersFromTemplate
    raise ValueError("No matching found")
ValueError: No matching found

linminhtoo avatar Sep 09 '22 07:09 linminhtoo

I also took a look at the meeko issue you linked, and actually, their method is much more robust, I think. It actually worked for me :D

from meeko import PDBQTMolecule
pdbqt_mol = PDBQTMolecule(top_pose_pdbqt, is_dlg=False, skip_typing=True, poses_to_read=1)
rdkit_mol = pdbqt_mol[0].export_rdkit_mol()

image as you can see, the molecule matches the template mol.

So, I would suggest we use meeko's way moving forward. They only import 3d atom coordinates from the docked pose .pdbqt file, and map those 3d coordinates onto the original rdkit molecule parsed from the SMILES. This way, we avoid dealing with bond orders which are problematic.

I am happy to help open a PR if you think this is an acceptable solution.

linminhtoo avatar Sep 09 '22 07:09 linminhtoo

This would be a valuable fix for the script. However, this vina_dock.py script is obsolete and it will be removed in some future revisions of the repository. Now I suggest to use this repo https://github.com/ci-lab-cz/docking-scripts, where we implemented support not only VIna2 but also gnina and smina (smina is invoked from gnina code). We continually support this repo.

If you will be able to fix issue there it will be helpful (https://github.com/ci-lab-cz/docking-scripts/issues/8). However, it will require more changes, because meeko changed interface of some of its functions we use.

DrrDom avatar Sep 20 '22 06:09 DrrDom