BioStructures.jl icon indicating copy to clipboard operation
BioStructures.jl copied to clipboard

Residue numbering sensitive to `db_align_beg`?

Open timholy opened this issue 7 months ago • 5 comments

This is probably very naive, but when one is working with a structure that represents a fragment, should there be an option to number the residues with respect to the "parent" (or whatever it would be called) sequence? I'm looking in particular at this section of 1bpi.cif:

_struct_ref_seq.pdbx_db_accession             P00974 
_struct_ref_seq.db_align_beg                  36 
_struct_ref_seq.pdbx_db_align_beg_ins_code    ? 
_struct_ref_seq.db_align_end                  93 

I take that to mean that the 58 residues in the file correspond to sequence positions 36:93 in P00974.

timholy avatar May 24 '25 11:05 timholy

We parse the _atom_site.auth_seq_id, which is the numbering given by the authors, rather than the automated _atom_site.label_seq_id to get the residue number. The numbering from the authors often but not always corresponds to something meaningful like the Uniprot sequence.

I'm not sure exactly what those mmCIF entries mean and how they relate to the _atom_site.auth_seq_id. We can think about adding it if it's useful to people, but I haven't heard anyone request it.

jgreener64 avatar May 26 '25 13:05 jgreener64

Just a couple of links in case they are useful:

  • explanation of fields: https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Categories/struct_ref_seq.html
  • UniProt's structure page for this P00974 entry: https://www.uniprot.org/uniprotkb/P00974/entry#structure (1bpi is listed as spanning the range 36-93)
  • UniProt's sequence page for this P00974 entry: https://www.uniprot.org/uniprotkb/P00974/entry#sequences, which is consistent with the span given and 1bpi's sequence

I'm not wedded to this idea, since it's easy to extract this info from the MMCIFDict. Am I right, though, in thinking that if you both want these extra bits of data and the structure, you have to parse the file twice?

timholy avatar May 27 '25 08:05 timholy

Am I right, though, in thinking that if you both want these extra bits of data and the structure, you have to parse the file twice?

Yes, though structure parsing skips lines until it finds the relevant fields (https://github.com/BioJulia/BioStructures.jl/blob/master/src/mmcif.jl#L178).

We could have a wrapper around read that returns the MMCIFDict alongside the structure and only parses the file once.

jgreener64 avatar May 27 '25 14:05 jgreener64

Another option might be to store extra fields in the MolecularStructure, but the Dict is empty unless the user requests certain fields:

struc = read("AAAA.cif", MMCIFFormat; headerfields = ["_struct_ref_seq"])
struc.header["_struct_ref_seq.pdbx_db_accession"]

timholy avatar May 27 '25 16:05 timholy

That's a nice idea!

jgreener64 avatar May 27 '25 16:05 jgreener64