bcbb
bcbb copied to clipboard
GFF module can write unreadable GFF3 sequences.
Discovered that GFF module supports writing out sequences with IDs that don't parse correctly on reading. If you submit a record to GFF for writing which has an ID which includes a space (e.g. gi|564292986| some protein
) then it fails to parse that upon trying to read the same file, because the code is looking for an integer there. Alternatively if the record ID contained numbers (gi|564292986| 324 5348
) then you could trick it into finding an incorrect sequence start/end value.
Traceback (most recent call last):
File "example.py", line 18, in <module>
for rec in GFF.parse(data):
File "/home/users/cpt/cpt/esr/Projects/cpt/dev/software/python-galaxy-tools/.venv/local/lib/python2.7/site-packages/BCBio/GFF/GFFParser.py", line 738, in parse
target_lines):
File "/home/users/cpt/cpt/esr/Projects/cpt/dev/software/python-galaxy-tools/.venv/local/lib/python2.7/site-packages/BCBio/GFF/GFFParser.py", line 328, in parse_in_parts
cur_dict = self._results_to_features(cur_dict, results)
File "/home/users/cpt/cpt/esr/Projects/cpt/dev/software/python-galaxy-tools/.venv/local/lib/python2.7/site-packages/BCBio/GFF/GFFParser.py", line 370, in _results_to_features
base = self._add_directives(base, results.get('directive', []))
File "/home/users/cpt/cpt/esr/Projects/cpt/dev/software/python-galaxy-tools/.venv/local/lib/python2.7/site-packages/BCBio/GFF/GFFParser.py", line 389, in _add_directives
val = (val[0], int(val[1]) - 1, int(val[2]))
ValueError: invalid literal for int() with base 10: 'conserved'
Code to reproduce the problem:
import StringIO
from BCBio import GFF
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
data = StringIO.StringIO()
record = SeqRecord(Seq("ACTG"), id="gi|564292986|ref|YP_008873682.1| conserved hypothetical protein [Staphylococcus phage Sb-1]")
GFF.write([record], data)
print "Wrote: ", data.getvalue()
# Open the GFF3 file
data.seek(0)
for rec in GFF.parse(data):
print rec.id
IMO, it would be best if the GFF module stripped everything after a space in record IDs when writing.
I am using the script glimmergff_to_proteins.py but I am getting this error. Can you suggest me how to troubleshoot since you are familiar with the code.
Traceback (most recent call last):
File "1.py", line 60, in