bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

bcftools annotate using large annotation VCF is slow regardless of input VCF size

Open oleraj opened this issue 4 years ago • 11 comments

Hi,

I'm trying to annotate a VCF with only 248 variants using the gnomad genomes VCF as annotation and it takes about 2 hours to run. Both the input VCF and the annotation VCF are indexed with tabix so I'm not sure why it would need to take such a long time. Oddly enough, when I annotate a VCF with 24K variants using the same gnomad VCF as annotation it also takes about 2 hours to run. And the same if I start with a VCF with only 8 variants. If each variant is looked up in the annotation VCF using the index, it seems like the speed would be proportional to the size of the input VCF. Is the entire annotation VCF stored in memory first before the annotation is done? Is there any way to speed up the annotation?

Here's my command:

bcftools annotate -a /data/gnomad.genomes.r2.0.1.sites.vt.vcf.gz --collapse 'none' -c "aaf_gnomad_gen:=AF,gnomad_gen_popmax_aaf:=AF_POPMAX,gnomad_gen_popmax:=POPMAX" -O z -o test.gnomad.vcf.gz  test.vcf.gz

I'm using bcftools 1.10. FYI, the gnomad genomes VCF is about 84GB and the tabix index is 2.7MB.

Thanks,

Andrew

oleraj avatar Apr 13 '20 16:04 oleraj

One idea to try might be to use a CSI index instead of a TBI index and use a smaller minimum interval size. You could generate a new index with something like bcftools index --min-shift 9 /data/gnomad.genomes.r2.0.1.sites.vt.vcf.gz, then rename/remove the old tabix index before running annotate again so it doesn't accidentally get picked up.

This idea is based mostly on this: https://github.com/brentp/vcfanno/issues/44#issuecomment-351510778, which helped me a while ago when I was trying to annotate with CADD (another large annotation source).

garrettjstevens avatar Apr 13 '20 17:04 garrettjstevens

I don't think csi index will make any difference, the problem is most likely related to these issues https://github.com/samtools/bcftools/issues/943, https://github.com/samtools/bcftools/issues/1047. Can you try with the --single-overlaps option?

pd3 avatar Apr 14 '20 08:04 pd3

Thanks for the idea @garrettjstevens. I tried this and it still took ~2 hours to run.
Thanks @pd3, I'll try that next.

oleraj avatar Apr 14 '20 18:04 oleraj

@pd3 I tested this option and it still took ~2 hours.

To see if it might be related to other features that have been added in the past I tested bcftools versions 1.2, 1.3, 1.4.1, 1.6 and 1.9. Versions 1.2 and 1.3 didn't work due to the --collapse option missing but the other versions also took ~2 hours (using the VCF with 8 variants), so it seems to be a long-standing issue.

oleraj avatar Apr 15 '20 19:04 oleraj

Sorry for the delay in responding. I looked at this again and realized what the problem is: when annotate transfers annotations from one file into another, it uses htslib's synchronized reading API which streams through both files simultaneously and throws away sites that are not found in both files. This is of course not great for your application.

A quick and easy way around this is to create a region file and restrict to these regions:

bcftools query -f'%CHROM\t%POS\n' small.vcf.gz > sites.txt
bcftools annotate -R sites.txt ....

In addition, the gnomad VCF is big and therefore very slow to parse. The performance can be further improved by converting to a BCF first. In this case it will also reduce the size of the file by about half.

bcftools view gnomad.vcf.gz -Ob -o gnomad.bcf

pd3 avatar May 06 '20 15:05 pd3

Thanks @pd3, this is very helpful. I'll try out your suggestions.

oleraj avatar May 06 '20 15:05 oleraj

Sorry for the delay in responding. I looked at this again and realized what the problem is: when annotate transfers annotations from one file into another, it uses htslib's synchronized reading API which streams through both files simultaneously and throws away sites that are not found in both files. This is of course not great for your application.

A quick and easy way around this is to create a region file and restrict to these regions:

bcftools query -f'%CHROM\t%POS\n' small.vcf.gz > sites.txt
bcftools annotate -R sites.txt ....

In addition, the gnomad VCF is big and therefore very slow to parse. The performance can be further improved by converting to a BCF first. In this case it will also reduce the size of the file by about half.

bcftools view gnomad.vcf.gz -Ob -o gnomad.bcf

It appears that this method doesn't always speed up and depends on the position of variants.

For example I have:

  • 1kg_af.txt.gz: bgzipped txt with columns (chromosome, start, end, alt, af) which contains all variants and AF from 1000G, ~0.5GB in size
  • input.vcf.gz: a VCF with around 3200 variants on chromosome 1-22

Without region filter, time taken is around 32 seconds:

bcftools annotate -a 1kg_af.txt.gz -h header.txt -c CHROM,FROM,TO,ALT,AF -Oz -o input.ann.vcf.gz input.vcf.gz

When filtered by all sites, time taken is still around 32 seconds:

bcftools query -f'%CHROM\t%POS\n' input.vcf.gz > sites.txt
bcftools annotate -R sites.txt ....

If filtered by first half of sites, time taken is reduced to around 21s:

head -n 1600 sites.txt > sites.half.txt
bcftools annotate -R sites.half.txt ....

Similarly, with first quarter of sites, time taken is further reduced to around 11s:

head -n 800 sites.txt > sites.quarter.txt
bcftools annotate -R sites.quarter.txt ....

However, if random quarter of sites are used as filter, time taken is still around 32s:

shuf sites.txt | head -n 800 | sort -k1,1 -k2,2n > sites.shuf.quarter.sort.txt
bcftools annotate -R sites.shuf.quarter.sort.txt ....

When only one site is considered per chromosome, time taken is just around 4s:

for i in {1..22}; do
    grep -P "^$i\t" sites.txt | awk 'NR==1' >> sites.chrom1.txt;
done
bcftools annotate -R sites.chrom1.txt ....

However, if both first and last sites are considered per chromosome, then time taken remains at around 32s:

for i in {1..22}; do
    grep -P "^$i\t" sites.txt | awk 'NR==1; END{print}' >> sites.chrom2.txt;
done
bcftools annotate -R sites.chrom2.txt ....

lacek avatar Jul 09 '20 09:07 lacek

I am also have this same exact issue and it seems to me that there are some unexplained issues. I am using a ~9.2GB annotation file available here and using the following to generate a test VCF to be annotated:

(echo "##fileformat=VCFv4.2";
echo "##contig=<ID=chrX,length=156040895>";
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO";
echo -e "chrX\t77361852\t.\tAATAT\tA\t.\t.\t.";
echo -e "chrX\t79395749\t.\tC\tT\t.\t.\t.";
echo -e "chrX\t80726216\t.\tTAACATATAACAC\tT\t.\t.\t.";
echo -e "chrX\t85220714\t.\tC\tA\t.\t.\t.";
echo -e "chrX\t86895704\t.\tC\tT\t.\t.\t.";
echo -e "chrX\t96540352\t.\tT\tG\t.\t.\t.";
echo -e "chrX\t103549770\t.\tT\tC\t.\t.\t.";
echo -e "chrX\t105539532\t.\tA\tG\t.\t.\t.";
echo -e "chrX\t109217387\t.\tA\tG\t.\t.\t.";
echo -e "chrX\t114616228\t.\tG\tGT\t.\t.\t.";
echo -e "chrX\t115690491\t.\tA\tG\t.\t.\t.";
echo -e "chrX\t117568750\t.\tG\tA\t.\t.\t.";
echo -e "chrX\t119754099\t.\tA\tAC\t.\t.\t.";
echo -e "chrX\t121549098\t.\tA\tG\t.\t.\t.";
echo -e "chrX\t124339104\t.\tT\tA\t.\t.\t.";
echo -e "chrX\t129769496\t.\tG\tT\t.\t.\t.";
echo -e "chrX\t132161909\t.\tC\tG\t.\t.\t.";
echo -e "chrX\t133786911\t.\tA\tT\t.\t.\t.";
echo -e "chrX\t135877733\t.\tG\tA\t.\t.\t.";
echo -e "chrX\t150353604\t.\tC\tG\t.\t.\t.";
echo -e "chrX\t154603140\t.\tC\tCTTAT\t.\t.\t.") | \
  bcftools view --no-version -Ob -o sites.bcf && \
  bcftools index --force sites.bcf
bcftools query -f'%CHROM\t%POS\n' sites.bcf > sites.txt

Then using the sites.txt file makes things very quick:

$ time bcftools annotate --no-version -a GCF_000001405.39.GRCh38.bcf -c RS sites.bcf -R sites.txt -o /dev/null

real	0m0.063s
user	0m0.043s
sys	0m0.020s

Restricting to chrX as a region is much slower, but still acceptable:

$ time bcftools annotate --no-version -a GCF_000001405.39.GRCh38.bcf -c RS sites.bcf -r chrX -o /dev/null

real	0m9.039s
user	0m8.957s
sys	0m0.081s

However, if I try without selecting a region, I get:

$ time bcftools annotate --no-version -a GCF_000001405.39.GRCh38.bcf -c RS sites.bcf -o /dev/null

real	3m44.256s
user	3m40.175s
sys	0m3.945s

Most notably, when I run the command:

$ bcftools annotate --no-version -a GCF_000001405.39.GRCh38.bcf -c RS sites.bcf

After approximately ten seconds I get the (almost) full output:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chrX,length=156040895>
##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chrX	77361852	.	AATAT	A	.	.	RS=202118283
chrX	79395749	.	C	T	.	.	RS=12559773
chrX	80726216	.	TAACATATAACAC	T	.	.	RS=754378013
chrX	85220714	.	C	A	.	.	RS=2188732
chrX	86895704	.	C	T	.	.	RS=756526961
chrX	96540352	.	T	G	.	.	RS=57343194
chrX	103549770	.	T	C	.	.	RS=1997648
chrX	105539532	.	A	G	.	.	RS=5962543
chrX	109217387	.	A	G	.	.	RS=200106209
chrX	114616228	.	G	GT	.	.	RS=781786026
chrX	115690491	.	A	G	.	.	RS=12836051
chrX	117568750	.	G	A	.	.	RS=149942029
chrX	119754099	.	A	AC	.	.	RS=75284556
chrX	121549098	.	A	G	.	.	RS=191376079
chrX	124339104	.	T	A	.	.	RS=200787584
chrX	129769496	.	G	T	.	.	RS=2281133
chrX	132161909	.	C	G	.	.	RS=149590924
chrX	133786911	.	A	T	.	.	RS=144670363
chrX	135877733	.	G	A	.	.	RS=1218635101
chrX	150353604	.	C	G	.	.	RS=5925454
chrX	154603140	.	C	CTTAT	.	.	RS=20165898

(one character is still missing at the end as the output is buffered). And yet, the process will take several minutes to complete and bcftools is using 100% of the CPU meanwhile. What is HTSlib/BCFtools doing? It seems like streaming chromosome X through both files simultaneously completes in ten seconds but after that something else that is not necessary and time consuming is happening. My guess is that HTSlib/BCFtools did not realize the first file reached the end and it is streaming through the whole of the second file. Is this the expected behavior? And when -r/-R is not used why does bcftools annotate require the VCFs to be indexed if it is going to parse through all of them anyway?

The problem was reproducible on separate Linux machines using bcftools 1.14 and bcftools 1.15.1

freeseek avatar Aug 09 '22 18:08 freeseek