raredisease
raredisease copied to clipboard
generate compressed vcf outputs
Description of feature
Some of the annotation programs used in the pipeline only generate vcf outputs. It would be good to make changes to those modules so that they can also generate compressed vcf outputs.
I ran this test
nextflow run main.nf -profile test,docker --outdir results
The only uncompressed vcf-file I could find in the outdir was results/annotate_mt/justhusky_vep_vcfanno_hmtnote_mt_annotated.vcf.
It turns out that currently both the uncompressed and compressed version of the abovementioned vcf-file is getting published. I'll just disable the publishing of the uncompressed vcf-file.
Are there any other uncmopressed vcf-files being published?
I found these files which are a bit large and can be compressed:
results/qc_bam/*.d4
results/qc_bam/*.wig
results/qc_bam/*.bw
Should I try to have them compressed?
As far as I know, compressed versions of these files cannot be used by downstream tools so if users are actively using them, they'd want it uncompressed. These files can always be compressed outside of our pipeline for archiving so I'd leave this as it is.
okay, well, then I can't find any output-files from the raredisease-pipeline that needs to be compressed. Do you know of any?
Not really. I haven't checked, but do you know if tools like vcfanno and svdb query are capable of producing compressed vcf files as outputs? If the tools can't, perhaps we can update the modules with an option to run bgzip on the output so they can produce compressed files? I am thinking a boolean flag like this. What do you think?
I think that neither vcfanno nor svdb-query can output compressed VCF-files.
Brent of vcfanno suggested just piping to compressor tool: https://github.com/brentp/vcfanno/issues/66
Nice! Perhaps we can modify vcfanno in nf-core/modules (so it has the option to generate compressed output) and then update the pipeline to use that version?
I'm not sure that is the right way to go. (I get the impression that nf-core likes modules to do just one thing, but I could be wrong.)
As far as I can tell, what you are doing now is fine:
https://github.com/nf-core/raredisease/blob/fdfb4a7c169b4ff3a0c5e76f85420ae6e84ec6d9/subworkflows/local/mitochondria/merge_annotate_MT.nf#L109-L113
No VCF-file is not published from VCFANNO_MT, but instead it is sent to HMTNOTE_ANNOTATE for annotation and then the annotated VCF-file is sent to ZIP_TABIX_HMTNOTE where it gets bgzipped and a corresponding TBI-file. Both the bgzipped annotated VCF-file and the TBI-file then gets published.
Hmmm.. I am not certain which way the community swings when it comes to adding functionalities like generating compressing outputs in a module. Perhaps we should bring this up on slack 😄
That's true, but I was thinking that work directory will get bloated with the uncompressed vcf. I do not have experience with cloud services, but maybe this will result in increased costs for the user? These files can easily take up a couple of Gigs, and that can add up over time.
Hmmm.. I am not certain which way the community swings when it comes to adding functionalities like generating compressing outputs in a module. Perhaps we should bring this up on slack 😄
That's true, but I was thinking that work directory will get bloated with the uncompressed vcf. I do not have experience with cloud services, but maybe this will result in increased costs for the user? These files can easily take up a couple of Gigs, and that can add up over time.
I got the impression that the idea is to delete the work-folder after the succesful completion of the pipeline. Still, I guess one wouldn't want the work-folder to be unnecessary large. Let's see what @maxulysse has to say about this 😊
I'm happy with adding gzip in the module for compression. We are trying to set up gold standards, and I believe that reducing data footprint is a good idea in any case
Doing some experiments on this. It seems that bgzip isn't available in the container that is used for hmtnote, but gzip is. Is it okay to use gzipor does it have to be bgzip?
If bgzip isn't available in he container, I would leave it as is. It needs to be bgzipped in order to be indexed and potentially merged back with the SNV vcf.
What about adding tabix as a dependency in the container?
But yeah, if you need to merge it back, you might not want to compress it
Do you mean we create a mulled container?
yeah, that's what I meant if we want to add this functionnality
I second that idea 👍🏻
Aren't mulled containers causing problems and frustration from time to time? Not sure it is worthwhile.
One could look into adding it to the standard conda recipe of vcfanno to get it into the biocontainer, however it feels a little like we would hijack that conda recipe for our own needs.