rna-tools
rna-tools copied to clipboard
Make it possible for RNAStructure to get initialized with a file
Hi, I'm working on an API where I need to get the sequence out of a .pdb file. The way the code base is set up now you need to have a local copy of the file, which would mean that you have to upload it twice in an API environment. First, the user uploads it to the server and then I would need to save it locally so I can provide a path to the RNAStructure class. I took out the parsing part of the code and implemented it, but if you would like I could create a PR for supporting this functionality and maybe a bit of a cleanup.
dear @valentin994 it's great that you found the code useful. Let me know if you need any help.
And yeah, let me see what you have so we can improve the package here.
In the end, I ended up creating a parser, it might be useful here too, or biopython so let me know what you think.
So the problem I stumbled upon when using RnaStructure().get_sequence
is that I wouldn't always get the sequence expected (I can't really explain in biological terms as I'm a developer but I'll run through examples that might give you insight onto this).
For example these pdb files "2l8h", "6b14", "6las" the output from get_sequence()
would be:
- GGCAGAUCUGAGCCUGGGAGCUCUCUGCCRh
- GACGCGACCGAAAUGGUGAAGGACGGGUCCAGUGCGAAACACGCACUGUUGAGUAGAGUGUGAGCUCCGUAACUGGUCGCGUCghhhhhhhhhhhhhhhhhhhhhhhhh
- GUUGAUAUGGAUUUACUCCGAGGAGACGAACUACCACGAACAGGGGAAACUCUACCCGUGGCGUCUCCGUUUGACGAGUAAGUCCUAAGUCAACAggggooooghhhhhhhh
- GGCAUUGUGCCUCGCAUUGCACUCCGCGGGGCGAUAAGUCCUGAAAAGGGAUGUCmhhhhhhhhh GGCAUUGUGCCUCGCAUUGCACUCCGCGGGGCGAUAAGUCCUGAAAAGGGAUGUChh RPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVKRSLKeRGQAFVIFKEVSSATNALRSeQGFPFYDKPeRIQYAKTDSDIIAKehh TRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVKRSLKeRGQAFVIFKEVSSATNALRSeQGFPFYDKPeRIQYAKTDSDIIAKeAhhhhh RPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVKRSLKeRGQAFVIFKEVSSATNALRSeQGFPFYDKPeRIQYAKTDSDIhhhh
While I would expect:
- GGCAGAUCUGAGCCUGGGAGCUCUCUGCC
- GACGCGACCGAAAUGGUGAAGGACGGGUCCAGUGCGAAACACGCACUGUUGAGUAGAGUGUGAGCUCCGUAACUGGUCGCGUC
- GUUGAUAUGGAUUUACUCCGAGGAGACGAACUACCACGAACAGGGGAAACUCUACCCGUGGCGUCUCCGUUUGACGAGUAAGUCCUAAGUCAACA
- GGCAUUGUGCCUCGCAUUGCACUCCGCGGGGCGAUAAGUCCUGAAAAGGGAUGUC
The expected sequences is what you can get if you get the fasta version of the files mentioned and you pull out the sequence with SeqIo parser. Now I'm not sure if the sequences I get from this package are the expected behaviour, or if there is any need to be able to parse them out like I do now. If you find it useful I can create a PR or show you more in depth how it would look. I hope that I managed to explain it well enough 😓
Oh, I sidetracked from the original question. But yeah initializing with bytes can also be done.