gatk icon indicating copy to clipboard operation
gatk copied to clipboard

Efficiently perform simple merge/append on many VCFs?

Open bbimber opened this issue 8 months ago • 4 comments

Hello,

This isnt exactly a GATK question, but I think this problem would be faced by many GATK users. Our current practice is to run many GATK-based jobs scatter/gathered, meaning 100s or 1000s of individual jobs where each operates against a set of coordinates. The result is N number of VCFs that we need to append to one another and bgzip to make a final VCF. This process is painfully slow when dealing with many jobs and large VCFs. Are there simple linux tricks to make this faster?

Our current pattern is something like this:

#!/bin/bash

# first write the VCF header to a file by itself named header.vcf

# Then zcat them in a block, piped to bgzip:
{
cat header.vcf
zcat vcf1.vcf.gz | grep -v '^#';
zcat vcf2.vcf.gz | grep -v '^#';
zcat vcf3.vcf.gz | grep -v '^#';
zcat vcf4.vcf.gz | grep -v '^#';
etc....
} | bgzip -f --threads XX > finalVcf.vcf.gz

I hope this slightly off topic question is alright here. I appreciate any suggestions people might have.

bbimber avatar Feb 18 '25 14:02 bbimber