PhiSpy
PhiSpy copied to clipboard
Error when sequence ID is too long
There is a small issue where one of the biopython functions has a character length limit on sequence IDs, a more informative error message might be useful. A fasta ID
>SEQID_TOO_LONG_BIOPY_HAS_CHAR_LIMIT
results in a genbank file which will give a PhiSpy traceback/error
[USERID]$ PhiSpy.py testgenome.gb -o phispyTest
Traceback (most recent call last):
File "$PATH/anaconda3/bin/PhiSpy.py", line 125, in <module>
main(sys.argv)
File "$PATH/anaconda3/bin/PhiSpy.py", line 48, in main
args_parser.record = PhiSpyModules.SeqioFilter(filter(lambda x: len(x.seq) > args_parser.min_contig_size, SeqIO.parse(handle, "genbank")))
File "$PATH/anaconda3/lib/python3.8/site-packages/PhiSpyModules/seqio_filter.py", line 33, in __init__
for n, item in enumerate(content):
File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 73, in __next__
return next(self.records)
File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 516, in parse_records
record = self.parse(handle, do_features)
File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 499, in parse
if self.feed(handle, consumer, do_features):
File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 465, in feed
self._feed_first_line(consumer, self.line)
File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 1572, in _feed_first_line
raise ValueError("Did not recognise the LOCUS line layout:\n" + line)
ValueError: Did not recognise the LOCUS line layout:
LOCUS SEQID_TOO_LONG_BIOPY_HAS_CHAR_LIMIT bp DNA linear
Changing the ID to
>SEQID_SHORT
resolves the problem.
Traceback (most recent call last):
File "/home/liu/miniconda3/envs/component/bin/PhiSpy.py", line 10, in
I have also encountered this issue, but I have hundreds of gbk files to process, so is there any way to batch shorten the IDs in the files
I have also encountered this issue, but I have hundreds of gbk files to process, so is there any way to batch shorten the IDs in the files
I met the same issue. Any clues on this?
Can you point me to a file where this issue occurs so that I can fix it?
Hi, I also had this issue. I initially tried to add the whitespace manually but that didn't work. My genbank files were annotated in PROKKA. Re-annotating using the --compliant flag for PROKKA fixed the issue for me as it parses the locus line in a different way.
@linsalrob @qianxin-kxy @jcmckerral thank you and the easy way would be to do this before running:
# this will remove all the spaces with the pipes
for i in *.fasta; do sed -i -e "s/ /|/g" ${i}; done
# cut the pipe at the place you want
for i in *.fasta; do cut -f 1 -d "|" ${i}; done
# all headers shorted.
Thank you
Gaurav
@ShanlinKe @TSZUoE see my response in this thread above.
if you have the C++ code, pointer declaration snippet, paste here, will do the convertible for the same
# this will remove all the spaces with the pipes
for i in *.fasta; do sed -i -e "s/ /|/g" ${i}; done
# cut the pipe at the place you want
for i in *.fasta; do cut -f 1 -d "|" ${i}; done
# all headers shorted.
Thank you Gaurav