rna-tools icon indicating copy to clipboard operation
rna-tools copied to clipboard

Make it possible for RNAStructure to get initialized with a file

Open valentin994 opened this issue 2 years ago • 3 comments

Hi, I'm working on an API where I need to get the sequence out of a .pdb file. The way the code base is set up now you need to have a local copy of the file, which would mean that you have to upload it twice in an API environment. First, the user uploads it to the server and then I would need to save it locally so I can provide a path to the RNAStructure class. I took out the parsing part of the code and implemented it, but if you would like I could create a PR for supporting this functionality and maybe a bit of a cleanup.

valentin994 avatar Jun 26 '22 15:06 valentin994

dear @valentin994 it's great that you found the code useful. Let me know if you need any help.

And yeah, let me see what you have so we can improve the package here.

mmagnus avatar Jul 08 '22 08:07 mmagnus

In the end, I ended up creating a parser, it might be useful here too, or biopython so let me know what you think.

So the problem I stumbled upon when using RnaStructure().get_sequence is that I wouldn't always get the sequence expected (I can't really explain in biological terms as I'm a developer but I'll run through examples that might give you insight onto this).

For example these pdb files "2l8h", "6b14", "6las" the output from get_sequence() would be:

  • GGCAGAUCUGAGCCUGGGAGCUCUCUGCCRh
  • GACGCGACCGAAAUGGUGAAGGACGGGUCCAGUGCGAAACACGCACUGUUGAGUAGAGUGUGAGCUCCGUAACUGGUCGCGUCghhhhhhhhhhhhhhhhhhhhhhhhh
  • GUUGAUAUGGAUUUACUCCGAGGAGACGAACUACCACGAACAGGGGAAACUCUACCCGUGGCGUCUCCGUUUGACGAGUAAGUCCUAAGUCAACAggggooooghhhhhhhh
  • GGCAUUGUGCCUCGCAUUGCACUCCGCGGGGCGAUAAGUCCUGAAAAGGGAUGUCmhhhhhhhhh GGCAUUGUGCCUCGCAUUGCACUCCGCGGGGCGAUAAGUCCUGAAAAGGGAUGUChh RPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVKRSLKeRGQAFVIFKEVSSATNALRSeQGFPFYDKPeRIQYAKTDSDIIAKehh TRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVKRSLKeRGQAFVIFKEVSSATNALRSeQGFPFYDKPeRIQYAKTDSDIIAKeAhhhhh RPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVKRSLKeRGQAFVIFKEVSSATNALRSeQGFPFYDKPeRIQYAKTDSDIhhhh

While I would expect:

  • GGCAGAUCUGAGCCUGGGAGCUCUCUGCC
  • GACGCGACCGAAAUGGUGAAGGACGGGUCCAGUGCGAAACACGCACUGUUGAGUAGAGUGUGAGCUCCGUAACUGGUCGCGUC
  • GUUGAUAUGGAUUUACUCCGAGGAGACGAACUACCACGAACAGGGGAAACUCUACCCGUGGCGUCUCCGUUUGACGAGUAAGUCCUAAGUCAACA
  • GGCAUUGUGCCUCGCAUUGCACUCCGCGGGGCGAUAAGUCCUGAAAAGGGAUGUC

The expected sequences is what you can get if you get the fasta version of the files mentioned and you pull out the sequence with SeqIo parser. Now I'm not sure if the sequences I get from this package are the expected behaviour, or if there is any need to be able to parse them out like I do now. If you find it useful I can create a PR or show you more in depth how it would look. I hope that I managed to explain it well enough 😓

valentin994 avatar Jul 08 '22 11:07 valentin994

Oh, I sidetracked from the original question. But yeah initializing with bytes can also be done.

valentin994 avatar Jul 08 '22 11:07 valentin994