genomepy
genomepy copied to clipboard
storing bgzip-compressed genomes
hi! Many thanks for writing this amazing library!! One issue that personally prevents my lab from switching to genomepy is that it seems to store genomes in uncompressed .fa files. Our concern is that, with potentially many genomes that we have to deal with, the library of genomes will take a lot of space; moreover, given that we use network storage, storing data uncompressed will reduce the I/O performance.
Have you considered allowing optional compression of genomes with bgzip? bgzip plays well with faidx/pyfaidx and does not have any downsides, at least as much as we're concerned.
Thank you! Anton.
This sounds like a great suggestion, thanks! I vaguely remember there was I reason I chose to use uncompressed fasta files, but this might be completely unnecessary. I will look into this!
awesome! Please, let me know if you need any help or would like to bounce ideas! Re: potential issues - agreed, in my experience, most if not all tools accept bzgip-compressed files.
On Wed, 12 Jun 2019 at 03:02, Simon van Heeringen [email protected] wrote:
This sounds like a great suggestion, thanks! I vaguely remember there was I reason I chose to use uncompressed fasta files, but this might be completely unnecessary. I will look into this!
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/simonvh/genomepy/issues/41?email_source=notifications&email_token=AAG64CTRKUW3RB4ECLCGHN3P2CNQ3A5CNFSM4HXCZ7HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXPOM5A#issuecomment-501147252, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG64CSVMTHYKLJH4EBME7LP2CNQ3ANCNFSM4HXCZ7HA .
@simonvh Please note that the pyfaidx bgzip routines are not as efficient as samtools yet. See https://github.com/mdshw5/pyfaidx/issues/126#issuecomment-432654422. I'd love to fully implement bgzip sequence region retrieval in the next pyfaidx release, and will update this issue at that time.
The work required is to store the list of BGZF blocks, along with their uncompressed offsets in the file (see the initial work I started on this here: https://github.com/mdshw5/pyfaidx/commit/db7f140ce97905d22c2280601f5234dc67711669#diff-9ca6d0b185ffae472a824fcdf50f0f9eR285) and then do a binary search to find the first block left of the required uncompressed offset to start reading from, read and decompress the required sequence, and then trim the resulting string so that it's the correct length before returning.
BGZF retrieval will always have a slight overhead compared to uncompressed FASTA, but I believe the tradeoff is well worth it.
Thanks for pitching in @mdshw5! At the moment I'm leaning towards bgzip-compressed genomes as a configurable option. In that case, I think a slight performance penalty would not be a problem. I'll be sure to document it.
There are currently still some tools that don't accept bgzipped genomes:
-
bedtools getfasta
- this can be replaced byfaidx
, so not a showstopper, however, some people might prefer to be able to usebedtools
. - Some aligners don't work on compressed FASTA files. This is also not a big deal, it just means that the index plugins need to do some extra work, depend
I've almost completed full implementation of BGZF indexing, and when all test are passing I'll update this thread with a new release of pyfaidx
that uses the samtools/tabix .gzi
block index.
Sounds fantastic!
@golobor release 0.6.0 of genomepy now has this functionality. Please let me know if there are bugs or if it doesn't work as expected.
I'm leaving this issue open so that @mdshw5 can update when pyfaidx
is updated.
wow, thanks a lot! It's going to take some time for me to test it, but I 100% will. Thank you for your work!!
I've been busy with some other work, but the "correct" bgzf access code is about 90% complete. I'll update here when it's ready.
I noticed the GRCh38.p13.fa.sizes and gaps files are empty when using compression. Not sure if the two are linked but removing compression produced the expected sizes file.
Hi @peterch405, I've tried to test this with the latest version (0.9.1) and cannot reproduce this.
Did you spot the .gz
extension in the filename?
genomepy install GRCh38.p13 -b
head ~/.local/share/genomes/GRCh38.p13/GRCh38.p13.fa.gz.sizes
1 248956422
10 133797422
11 135086622
12 133275309
13 114364328
14 107043718
15 101991189
16 90338345
17 83257441
18 80373285
If this problem persists we can look at it in a separate issue!
I tried it again on my local machine and it works. Must have been something about the cluster environment. I had a separate issue, but I think unrelated. I will post a new issue.