boltz icon indicating copy to clipboard operation
boltz copied to clipboard

parser/mmcif failed to parse .cif template

Open data2code opened this issue 5 months ago • 9 comments

Boltz2 currently seems unable to support .cif files generated by BioPython (b/c it does not contain entity.id section).

This can be reproduced with the following code.

First we generate a 1crn_saved.cif file using BioPython.

from Bio.PDB import PDBList, MMCIFIO, MMCIFParser
import gemmi

# Step 1: Fetch the structure for 1CRN from the PDB
pdbl = PDBList()
pdb_id = "1crn"
pdb_file = pdbl.retrieve_pdb_file(pdb_id, pdir='.', file_format='mmCif')

# Step 2: Load the structure using Biopython
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure(pdb_id, pdb_file)

# Step 3: Save the structure to a new .cif file using Biopython
output_cif = "1crn_saved.cif"
io = MMCIFIO()
io.set_structure(structure)
io.save(output_cif)

We then verify 1crn_saved.cif can be read by many packages, such as gemmi. The output is

Structure name: 1crn
Number of models: 1
Chains in first model: ['A']
# Step 4: Load the saved .cif file using Gemmi
gemmi_structure = gemmi.read_structure(output_cif)

# Step 5: Print basic info
print(f"Structure name: {gemmi_structure.name}")
print(f"Number of models: {len(gemmi_structure)}")
print("Chains in first model:", [chain.name for chain in gemmi_structure[0]])

The code below shows the 1crn_saved.cif file cannot be processed by Boltz2

import sys
sys.path.insert(0, 'src/boltz/data/parse')
from mmcif import parse_mmcif
parse_mmcif("1crn_saved.cif")

The error message is: File "src/boltz/data/parse/mmcif.py", line 890, in parse_mmcif entity: gemmi.Entity = entities[subchain_id] ~~~~~~~~^^^^^^^^^^^^^ KeyError: 'A'

I think this is because Boltz2 code relies on entities, which is an empty list in this example. Could the boltz code be improved to acquire data more robustly from .cif?

Thanks!

data2code avatar Jul 08 '25 18:07 data2code

Yeah, I believe I can fix that. I'll take a look, thanks for flagging.

jwohlwend avatar Jul 08 '25 19:07 jwohlwend

Just an FYI, in my hands I observe the same behavior. A cif file from the RCSB works though.

metma99 avatar Jul 09 '25 13:07 metma99

I also had trouble with files created using BioPython or PyMol.

I found that Maxit worked in my case (I downloaded source code for maxit-v11.300)

shiakim avatar Jul 18 '25 02:07 shiakim

Chiming in with my experience in case it may be useful. Using OpenBabel to convert PDBs to mmCIFs didn't work for me, but using gemmi does, as long as I make sure the SEQRES is correct or otherwise manually add it using gemmi.

seankhl avatar Jul 24 '25 02:07 seankhl

I tried to generate .cif file from a .pdb using gemmi, pymol and chimeraX to no avail.

jodypacalon avatar Jul 24 '25 08:07 jodypacalon

I tried to generate .cif file from a .pdb using gemmi, pymol and chimeraX to no avail.

Does your PDB have a SEQRES in it?

seankhl avatar Jul 24 '25 17:07 seankhl

@seankhl does your converted CIF have _entity_poly? I am using gemmi=0.6.5 and converting from a cropped PDB (after manually adding back the SEQRES bc PyMol deletes this information when I crop). If I input this as a template, however, I get "ValueError: No chains parsed!". I was able to resolve by looking at the original cif from RCSB and adding in the _entity_poly loop but wondering if there is a less janky approach.

jwendlan avatar Jul 25 '25 15:07 jwendlan

I also had trouble with files created using BioPython or PyMol.

I found that Maxit worked in my case (I downloaded source code for maxit-v11.300)

Maxit works for me as well (nothing else worked).

ErikHartman avatar Aug 01 '25 12:08 ErikHartman

You can also use this service which runs Maxit under the hood: https://mmcif.pdbj.org/converter/index.php?l=en

arneschneuing avatar Oct 30 '25 11:10 arneschneuing