anvio
anvio copied to clipboard
anvi-gen-contigs-database: gzip'ed fasta input
The need
It appears that anvi-gen-contigs-database
cannot load gzip'ed fasta files:
File/Path Error: Someone is not happy with your FASTA file
'/path/to/file/assembly.fasta.gz' (this is
what the lib says: 'Fasta Lib Error: File
'/path/to/file/assembly.fasta.gz' does not
seem to be a FASTA file.'
The solution
This should be a simple feature addition, depending on the fasta parsing code
Beneficiaries
Most people gzip/bzip2 compress their genome fasta files, given the space savings
Notes
I'm using anvio 7.1
The same goes for anvi-script-reformat-fasta
:
$ anvi-script-reformat-fasta -o assembly.fasta assembly.fasta.gz
Traceback (most recent call last):
File "/tmp/global2/nyoungblut/code/Lreuteri_comp-genomics/conda_envs/Lreuteri-anvio/bin/anvi-script-reformat-fasta", line 216, in <module>
reformat_FASTA(args)
File "/tmp/global2/nyoungblut/code/Lreuteri_comp-genomics/conda_envs/Lreuteri-anvio/bin/anvi-script-reformat-fasta", line 75, in reformat_FASTA
fasta = u.SequenceSource(args.contigs_fasta)
File "/tmp/global2/nyoungblut/code/Lreuteri_comp-genomics/conda_envs/Lreuteri-anvio/lib/python3.6/site-packages/anvio/fastalib.py", line 104, in __init__
raise FastaLibError("File '%s' does not seem to be a FASTA file." % self.fasta_file_path)
anvio.fastalib.FastaLibError: Fasta Lib Error: File 'assembly.fasta.gz' does not seem to be a FASTA file.
It appears that you are not including a decode for if not self.file_pointer.read(1) == '>':
in fastalib.py
.
For example:
def decode(x):
try:
return x.decode()
except AttributeError:
return x
[...]
if not decode(self.file_pointer.read(1)) == '>':
[...]
Great idea. I will take care of this soon.
@meren would it be okay if I attempt a PR for this one?
Hi @vinisalazar, thanks a lot for looking into this. Of course you are more than welcome to send a PR.
I remember looking at his and realizing that addressing this properly will require much more than a single change as I had initially suspected, but if there is a PR we can perhaps start figuring out the rest of the problems!
Thanks @meren, the fix seemed to have been simple enough, I'm running some additional tests to make sure I didn't blow up anything.
Also, cc'ing @FlorianTrigodet as we were talking today about the possibility of compressing sequences in the contigs DB. Happy to try working on that in the future if it should become a feature.
I remember looking at his and realizing that addressing this properly will require much more than a single change as I had initially suspected
To elaborate, I was inclined to make some additional changes to the SequenceSource
class to e.g. programmatically infer whether the file is compressed rather than looking at the extension, but just thought I would keep it simple for a first PR.
Thank you, @vinisalazar. Please feel free to make additional changes.
Given that now both anvi-script-reformat-fasta
and anvi-gen-contigs-database
work with gzipped input files properly, I think another quick improvement could be to gzip the output file generated by anvi-script-reformat-fasta
if the output file name ends with .gz
:) That way from reformatting to the generation of contigs-db files, the storage would never have to see an uncompressed FASTA file.
But I think this issue is now addressed via #1997 and I will close it.