BioStructures.jl icon indicating copy to clipboard operation
BioStructures.jl copied to clipboard

Secondary Structure Information

Open emrekuecuek opened this issue 2 years ago • 2 comments

Hello. In my project, I am using secondary structures. Without going into detail, I need to sample CA atoms from different secondary structures for my algorithm. This can easily be done by parsing strings, I know, but I think it might be a nice feature to be able to obtain secondary structure information like any other feature in a PDB file.

Expected Behavior

One solution might be creating a function such as helix(struc::ProteinStructure) or sheet(struc::ProteinStructure) and obtain the lines which contains the starting and ending residues of the secondary structure. Then we can get the list of atoms/residues with a combination like this:

collectatoms(struc::ProteinStructure, calphaselector)[helix(struc::ProteinStructure)[1]] # to obtain the atoms belonging the first Alpha Helix

Current Behavior

I could not find something related with this suggestion in the documentation, if there is one, I am genuinely sorry.

Possible Solution / Implementation

Though I have some experience of the source code in the spatial.jl file, I do not have any experience regarding parsing PDB files. My suggestion might be writing a string parser as a function but I am not sure how we can connect it with a ProteinStructure structure.

Context

My project is related with secondary structures. I think it might be nice to be able to obtain regarding information for those who in need.

emrekuecuek avatar Sep 20 '21 16:09 emrekuecuek

You are right, it would be useful. On the todo list is wrapping DSSP to calculate secondary structure from the structure itself, rather than reading it from the PDB/mmCIF header. This approach fits the philosophy of BioStructures better than reading the header, since it would work on custom PDB files without a header too.

No promises about that being implemented soon though, sorry. In the meantime you could write the mmCIF header parsing functions you need using the mmCIF dictionary, for example for helices:

using BioStructures
downloadpdb("1AKE", format=MMCIF)
d = MMCIFDict("1AKE.cif")
hs, he = d["_struct_conf.beg_auth_seq_id"], d["_struct_conf.end_auth_seq_id"]
helices = [(parse(Int, s), parse(Int, e)) for (s, e) in zip(hs, he)]

Note chain IDs etc. are neglected for simplicity in this example.

jgreener64 avatar Sep 20 '21 18:09 jgreener64

Thank you for your very fast response. I will think about using mmCIF files, maybe I can parse string from PDB files as well, I haven't decided on that yet. Maybe after my graduation, I would like to make some contributions in my free time to this great project. Perhaps even this issue maybe? :sweat_smile:

emrekuecuek avatar Sep 22 '21 10:09 emrekuecuek

There is a new package ProteinSecondaryStructures.jl. Is it possible to add the support for secondary structure information based on it?

shuuul avatar Sep 17 '23 15:09 shuuul

ProteinSecondaryStructures.jl is based on PDBTools.jl. Maybe we can directly using DSSP_jll or STRIDE_jll and add a ss property to Residue.

shuuul avatar Sep 17 '23 16:09 shuuul

There is discussion about running ProteinSecondaryStructures.jl on BioStructures.jl types at https://github.com/m3g/ProteinSecondaryStructures.jl/issues/4.

It would also be cool to use DSSP_jll or STRIDE_jll from within BioStructures.jl, but I am unlikely to have the time to add this. Feel free to work on a PR.

jgreener64 avatar Sep 18 '23 13:09 jgreener64

@jgreener64 I made a PR https://github.com/BioJulia/BioStructures.jl/pull/43. Please check and let me know what can be improved. I am new to Julia development.

shuuul avatar Sep 18 '23 17:09 shuuul

Secondary structure support now added, thanks to @shuuul.

jgreener64 avatar Sep 23 '23 22:09 jgreener64