raredisease icon indicating copy to clipboard operation
raredisease copied to clipboard

generate compressed vcf outputs

Open ramprasadn opened this issue 3 years ago • 19 comments

Description of feature

Some of the annotation programs used in the pipeline only generate vcf outputs. It would be good to make changes to those modules so that they can also generate compressed vcf outputs.

ramprasadn avatar Nov 04 '22 07:11 ramprasadn

I ran this test

nextflow run main.nf -profile test,docker --outdir results

The only uncompressed vcf-file I could find in the outdir was results/annotate_mt/justhusky_vep_vcfanno_hmtnote_mt_annotated.vcf.

It turns out that currently both the uncompressed and compressed version of the abovementioned vcf-file is getting published. I'll just disable the publishing of the uncompressed vcf-file.

Are there any other uncmopressed vcf-files being published?

asp8200 avatar Jun 29 '23 11:06 asp8200

I found these files which are a bit large and can be compressed:

results/qc_bam/*.d4
results/qc_bam/*.wig
results/qc_bam/*.bw

Should I try to have them compressed?

asp8200 avatar Jun 29 '23 13:06 asp8200

As far as I know, compressed versions of these files cannot be used by downstream tools so if users are actively using them, they'd want it uncompressed. These files can always be compressed outside of our pipeline for archiving so I'd leave this as it is.

ramprasadn avatar Jun 30 '23 11:06 ramprasadn

okay, well, then I can't find any output-files from the raredisease-pipeline that needs to be compressed. Do you know of any?

asp8200 avatar Jun 30 '23 11:06 asp8200

Not really. I haven't checked, but do you know if tools like vcfanno and svdb query are capable of producing compressed vcf files as outputs? If the tools can't, perhaps we can update the modules with an option to run bgzip on the output so they can produce compressed files? I am thinking a boolean flag like this. What do you think?

ramprasadn avatar Jun 30 '23 11:06 ramprasadn

I think that neither vcfanno nor svdb-query can output compressed VCF-files.

Brent of vcfanno suggested just piping to compressor tool: https://github.com/brentp/vcfanno/issues/66

asp8200 avatar Jun 30 '23 11:06 asp8200

Nice! Perhaps we can modify vcfanno in nf-core/modules (so it has the option to generate compressed output) and then update the pipeline to use that version?

ramprasadn avatar Jul 04 '23 16:07 ramprasadn

I'm not sure that is the right way to go. (I get the impression that nf-core likes modules to do just one thing, but I could be wrong.)

As far as I can tell, what you are doing now is fine:

https://github.com/nf-core/raredisease/blob/fdfb4a7c169b4ff3a0c5e76f85420ae6e84ec6d9/subworkflows/local/mitochondria/merge_annotate_MT.nf#L109-L113

No VCF-file is not published from VCFANNO_MT, but instead it is sent to HMTNOTE_ANNOTATE for annotation and then the annotated VCF-file is sent to ZIP_TABIX_HMTNOTE where it gets bgzipped and a corresponding TBI-file. Both the bgzipped annotated VCF-file and the TBI-file then gets published.

asp8200 avatar Jul 04 '23 16:07 asp8200

Hmmm.. I am not certain which way the community swings when it comes to adding functionalities like generating compressing outputs in a module. Perhaps we should bring this up on slack 😄

That's true, but I was thinking that work directory will get bloated with the uncompressed vcf. I do not have experience with cloud services, but maybe this will result in increased costs for the user? These files can easily take up a couple of Gigs, and that can add up over time.

ramprasadn avatar Jul 06 '23 07:07 ramprasadn

Hmmm.. I am not certain which way the community swings when it comes to adding functionalities like generating compressing outputs in a module. Perhaps we should bring this up on slack 😄

That's true, but I was thinking that work directory will get bloated with the uncompressed vcf. I do not have experience with cloud services, but maybe this will result in increased costs for the user? These files can easily take up a couple of Gigs, and that can add up over time.

I got the impression that the idea is to delete the work-folder after the succesful completion of the pipeline. Still, I guess one wouldn't want the work-folder to be unnecessary large. Let's see what @maxulysse has to say about this 😊

asp8200 avatar Jul 06 '23 08:07 asp8200

I'm happy with adding gzip in the module for compression. We are trying to set up gold standards, and I believe that reducing data footprint is a good idea in any case

maxulysse avatar Jul 06 '23 08:07 maxulysse

Doing some experiments on this. It seems that bgzip isn't available in the container that is used for hmtnote, but gzip is. Is it okay to use gzipor does it have to be bgzip?

asp8200 avatar Jul 06 '23 11:07 asp8200

If bgzip isn't available in he container, I would leave it as is. It needs to be bgzipped in order to be indexed and potentially merged back with the SNV vcf.

jemten avatar Jul 06 '23 11:07 jemten

What about adding tabix as a dependency in the container?

But yeah, if you need to merge it back, you might not want to compress it

maxulysse avatar Jul 06 '23 12:07 maxulysse

Do you mean we create a mulled container?

ramprasadn avatar Jul 06 '23 12:07 ramprasadn

yeah, that's what I meant if we want to add this functionnality

maxulysse avatar Jul 06 '23 12:07 maxulysse

I second that idea 👍🏻

ramprasadn avatar Jul 06 '23 12:07 ramprasadn

Aren't mulled containers causing problems and frustration from time to time? Not sure it is worthwhile.

asp8200 avatar Jul 06 '23 13:07 asp8200

One could look into adding it to the standard conda recipe of vcfanno to get it into the biocontainer, however it feels a little like we would hijack that conda recipe for our own needs.

jemten avatar Jul 06 '23 13:07 jemten