biopandas icon indicating copy to clipboard operation
biopandas copied to clipboard

mmCIF -> PDB conversion method

Open rasbt opened this issue 3 years ago • 3 comments

The mmCIF parser and class now opens the way for a mmCIF -> PDB conversion method :)

rasbt avatar Apr 07 '22 02:04 rasbt

may I know the progress on this? this would be super useful as many traditional methods + ML models that work with protein 3d structures still rely on PDB files, but I understand people are moving towards storing them as mmCIF files (eg AlphaFold's training data was entirely 180,000 mmCIF files)

I also understand that BioPython's conversion is not exactly foolproof (especially for large structures?)

linminhtoo avatar May 13 '22 03:05 linminhtoo

Needed to process some structures that had only .cif's. Here's something I hacked together to convert the ATOM and HETATM dataframes of PandasMmcif() to PandasPdb() format, enjoy :)


pdb_order = [
    "record_name",
    "atom_number",
    "blank_1",
    "atom_name",
    "alt_loc",
    "residue_name",
    "blank_2",
    "chain_id",
    "residue_number",
    "insertion",
    "blank_3",
    "x_coord",
    "y_coord",
    "z_coord",
    "occupancy",
    "b_factor",
    "blank_4",
    "segment_id",
    "element_symbol",
    "charge",
    "line_idx",
]
mmcif_read = {
    "group_PDB": "record_name",
    "id": "atom_number",
    "auth_atom_id": "atom_name",
    "auth_comp_id": "residue_name",
    "auth_asym_id": "chain_id",
    "auth_seq_id": "residue_number",
    "Cartn_x": "x_coord",
    "Cartn_y": "y_coord",
    "Cartn_z": "z_coord",
    "occupancy": "occupancy",
    "B_iso_or_equiv": "b_factor",
    "type_symbol": "element_symbol",
}

nonefiels = [
    "blank_1",
    "alt_loc",
    "blank_2",
    "insertion",
    "blank_3",
    "blank_4",
    "segment_id",
    "charge",
    "line_idx",
]


def biopandas_mmcif2pdb(pandasmmcif):
    """
    Converts the ATOM and HETATM dataframes of PandasMmcif() to PandasPdb() format.
    """
    pandaspdb = PandasPdb()
    for a in ["ATOM", "HETATM"]:
        dfa = pandasmmcif.df[a]
        # keep only those fields found in pdb
        dfa = dfa[mmcif_read.keys()]
        # rename fields
        dfa = dfa.rename(columns=mmcif_read)
        # add empty fields
        for i in nonefields:
            dfa[i] = ""
        dfa["charge"] = np.nan
        # reorder columns to PandasPdb order
        dfa = dfa[pdb_order]
        pandaspdb.df[a] = dfa

    # update line_idx
    pandaspdb.df["ATOM"]["line_idx"] = pandaspdb.df["ATOM"].index.values
    pandaspdb.df["HETATM"]["line_idx"] = pandaspdb.df["HETATM"].index

    return pandaspdb

mrauha avatar Aug 05 '22 12:08 mrauha

Thanks for sharing, that's very helpful. Maybe worthwhile adding it as a utility function to biopandas some time!

rasbt avatar Aug 05 '22 14:08 rasbt