anvio icon indicating copy to clipboard operation
anvio copied to clipboard

anvi-gen-contigs-database: gzip'ed fasta input

Open nick-youngblut opened this issue 2 years ago • 2 comments

The need

It appears that anvi-gen-contigs-database cannot load gzip'ed fasta files:

File/Path Error: Someone is not happy with your FASTA file
                 '/path/to/file/assembly.fasta.gz' (this is
                 what the lib says: 'Fasta Lib Error: File
                 '/path/to/file/assembly.fasta.gz' does not
                 seem to be a FASTA file.'

The solution

This should be a simple feature addition, depending on the fasta parsing code

Beneficiaries

Most people gzip/bzip2 compress their genome fasta files, given the space savings

Notes

I'm using anvio 7.1

nick-youngblut avatar Apr 13 '22 13:04 nick-youngblut

The same goes for anvi-script-reformat-fasta:

$ anvi-script-reformat-fasta -o assembly.fasta assembly.fasta.gz
Traceback (most recent call last):
  File "/tmp/global2/nyoungblut/code/Lreuteri_comp-genomics/conda_envs/Lreuteri-anvio/bin/anvi-script-reformat-fasta", line 216, in <module>
    reformat_FASTA(args)
  File "/tmp/global2/nyoungblut/code/Lreuteri_comp-genomics/conda_envs/Lreuteri-anvio/bin/anvi-script-reformat-fasta", line 75, in reformat_FASTA
    fasta = u.SequenceSource(args.contigs_fasta)
  File "/tmp/global2/nyoungblut/code/Lreuteri_comp-genomics/conda_envs/Lreuteri-anvio/lib/python3.6/site-packages/anvio/fastalib.py", line 104, in __init__
    raise FastaLibError("File '%s' does not seem to be a FASTA file." % self.fasta_file_path)
anvio.fastalib.FastaLibError: Fasta Lib Error: File 'assembly.fasta.gz' does not seem to be a FASTA file.

It appears that you are not including a decode for if not self.file_pointer.read(1) == '>': in fastalib.py.

For example:

def decode(x):
    try:
        return x.decode()
    except AttributeError:
        return x

[...]

if not decode(self.file_pointer.read(1)) == '>':
   [...]

nick-youngblut avatar Apr 13 '22 13:04 nick-youngblut

Great idea. I will take care of this soon.

meren avatar Apr 14 '22 13:04 meren

@meren would it be okay if I attempt a PR for this one?

vinisalazar avatar Oct 25 '22 07:10 vinisalazar

Hi @vinisalazar, thanks a lot for looking into this. Of course you are more than welcome to send a PR.

I remember looking at his and realizing that addressing this properly will require much more than a single change as I had initially suspected, but if there is a PR we can perhaps start figuring out the rest of the problems!

meren avatar Oct 25 '22 08:10 meren

Thanks @meren, the fix seemed to have been simple enough, I'm running some additional tests to make sure I didn't blow up anything.

Also, cc'ing @FlorianTrigodet as we were talking today about the possibility of compressing sequences in the contigs DB. Happy to try working on that in the future if it should become a feature.

vinisalazar avatar Oct 25 '22 08:10 vinisalazar

I remember looking at his and realizing that addressing this properly will require much more than a single change as I had initially suspected

To elaborate, I was inclined to make some additional changes to the SequenceSource class to e.g. programmatically infer whether the file is compressed rather than looking at the extension, but just thought I would keep it simple for a first PR.

vinisalazar avatar Oct 25 '22 09:10 vinisalazar

Thank you, @vinisalazar. Please feel free to make additional changes.

Given that now both anvi-script-reformat-fasta and anvi-gen-contigs-database work with gzipped input files properly, I think another quick improvement could be to gzip the output file generated by anvi-script-reformat-fasta if the output file name ends with .gz :) That way from reformatting to the generation of contigs-db files, the storage would never have to see an uncompressed FASTA file.

But I think this issue is now addressed via #1997 and I will close it.

meren avatar Oct 25 '22 11:10 meren