Residue numbering sensitive to `db_align_beg`?
This is probably very naive, but when one is working with a structure that represents a fragment, should there be an option to number the residues with respect to the "parent" (or whatever it would be called) sequence? I'm looking in particular at this section of 1bpi.cif:
_struct_ref_seq.pdbx_db_accession P00974
_struct_ref_seq.db_align_beg 36
_struct_ref_seq.pdbx_db_align_beg_ins_code ?
_struct_ref_seq.db_align_end 93
I take that to mean that the 58 residues in the file correspond to sequence positions 36:93 in P00974.
We parse the _atom_site.auth_seq_id, which is the numbering given by the authors, rather than the automated _atom_site.label_seq_id to get the residue number. The numbering from the authors often but not always corresponds to something meaningful like the Uniprot sequence.
I'm not sure exactly what those mmCIF entries mean and how they relate to the _atom_site.auth_seq_id. We can think about adding it if it's useful to people, but I haven't heard anyone request it.
Just a couple of links in case they are useful:
- explanation of fields: https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Categories/struct_ref_seq.html
- UniProt's structure page for this P00974 entry: https://www.uniprot.org/uniprotkb/P00974/entry#structure (1bpi is listed as spanning the range 36-93)
- UniProt's sequence page for this P00974 entry: https://www.uniprot.org/uniprotkb/P00974/entry#sequences, which is consistent with the span given and 1bpi's sequence
I'm not wedded to this idea, since it's easy to extract this info from the MMCIFDict. Am I right, though, in thinking that if you both want these extra bits of data and the structure, you have to parse the file twice?
Am I right, though, in thinking that if you both want these extra bits of data and the structure, you have to parse the file twice?
Yes, though structure parsing skips lines until it finds the relevant fields (https://github.com/BioJulia/BioStructures.jl/blob/master/src/mmcif.jl#L178).
We could have a wrapper around read that returns the MMCIFDict alongside the structure and only parses the file once.
Another option might be to store extra fields in the MolecularStructure, but the Dict is empty unless the user requests certain fields:
struc = read("AAAA.cif", MMCIFFormat; headerfields = ["_struct_ref_seq"])
struc.header["_struct_ref_seq.pdbx_db_accession"]
That's a nice idea!