genomepy icon indicating copy to clipboard operation
genomepy copied to clipboard

storing bgzip-compressed genomes

Open golobor opened this issue 5 years ago • 12 comments

hi! Many thanks for writing this amazing library!! One issue that personally prevents my lab from switching to genomepy is that it seems to store genomes in uncompressed .fa files. Our concern is that, with potentially many genomes that we have to deal with, the library of genomes will take a lot of space; moreover, given that we use network storage, storing data uncompressed will reduce the I/O performance.

Have you considered allowing optional compression of genomes with bgzip? bgzip plays well with faidx/pyfaidx and does not have any downsides, at least as much as we're concerned.

Thank you! Anton.

golobor avatar Jun 11 '19 19:06 golobor

This sounds like a great suggestion, thanks! I vaguely remember there was I reason I chose to use uncompressed fasta files, but this might be completely unnecessary. I will look into this!

simonvh avatar Jun 12 '19 07:06 simonvh

awesome! Please, let me know if you need any help or would like to bounce ideas! Re: potential issues - agreed, in my experience, most if not all tools accept bzgip-compressed files.

On Wed, 12 Jun 2019 at 03:02, Simon van Heeringen [email protected] wrote:

This sounds like a great suggestion, thanks! I vaguely remember there was I reason I chose to use uncompressed fasta files, but this might be completely unnecessary. I will look into this!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/simonvh/genomepy/issues/41?email_source=notifications&email_token=AAG64CTRKUW3RB4ECLCGHN3P2CNQ3A5CNFSM4HXCZ7HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXPOM5A#issuecomment-501147252, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG64CSVMTHYKLJH4EBME7LP2CNQ3ANCNFSM4HXCZ7HA .

golobor avatar Jun 12 '19 19:06 golobor

@simonvh Please note that the pyfaidx bgzip routines are not as efficient as samtools yet. See https://github.com/mdshw5/pyfaidx/issues/126#issuecomment-432654422. I'd love to fully implement bgzip sequence region retrieval in the next pyfaidx release, and will update this issue at that time.

The work required is to store the list of BGZF blocks, along with their uncompressed offsets in the file (see the initial work I started on this here: https://github.com/mdshw5/pyfaidx/commit/db7f140ce97905d22c2280601f5234dc67711669#diff-9ca6d0b185ffae472a824fcdf50f0f9eR285) and then do a binary search to find the first block left of the required uncompressed offset to start reading from, read and decompress the required sequence, and then trim the resulting string so that it's the correct length before returning.

BGZF retrieval will always have a slight overhead compared to uncompressed FASTA, but I believe the tradeoff is well worth it.

mdshw5 avatar Jun 14 '19 18:06 mdshw5

Thanks for pitching in @mdshw5! At the moment I'm leaning towards bgzip-compressed genomes as a configurable option. In that case, I think a slight performance penalty would not be a problem. I'll be sure to document it.

There are currently still some tools that don't accept bgzipped genomes:

  • bedtools getfasta- this can be replaced by faidx, so not a showstopper, however, some people might prefer to be able to use bedtools.
  • Some aligners don't work on compressed FASTA files. This is also not a big deal, it just means that the index plugins need to do some extra work, depend

simonvh avatar Jun 24 '19 09:06 simonvh

I've almost completed full implementation of BGZF indexing, and when all test are passing I'll update this thread with a new release of pyfaidx that uses the samtools/tabix .gzi block index.

mdshw5 avatar Jun 24 '19 14:06 mdshw5

Sounds fantastic!

simonvh avatar Jun 25 '19 07:06 simonvh

@golobor release 0.6.0 of genomepy now has this functionality. Please let me know if there are bugs or if it doesn't work as expected.

I'm leaving this issue open so that @mdshw5 can update when pyfaidx is updated.

simonvh avatar Sep 11 '19 08:09 simonvh

wow, thanks a lot! It's going to take some time for me to test it, but I 100% will. Thank you for your work!!

golobor avatar Sep 11 '19 09:09 golobor

I've been busy with some other work, but the "correct" bgzf access code is about 90% complete. I'll update here when it's ready.

mdshw5 avatar Sep 11 '19 14:09 mdshw5

I noticed the GRCh38.p13.fa.sizes and gaps files are empty when using compression. Not sure if the two are linked but removing compression produced the expected sizes file.

peterch405 avatar Nov 13 '20 02:11 peterch405

Hi @peterch405, I've tried to test this with the latest version (0.9.1) and cannot reproduce this. Did you spot the .gz extension in the filename?

genomepy install GRCh38.p13 -b
head ~/.local/share/genomes/GRCh38.p13/GRCh38.p13.fa.gz.sizes
1       248956422
10      133797422
11      135086622
12      133275309
13      114364328
14      107043718
15      101991189
16      90338345
17      83257441
18      80373285

If this problem persists we can look at it in a separate issue!

siebrenf avatar Nov 13 '20 15:11 siebrenf

I tried it again on my local machine and it works. Must have been something about the cluster environment. I had a separate issue, but I think unrelated. I will post a new issue.

peterch405 avatar Nov 17 '20 06:11 peterch405