htslib icon indicating copy to clipboard operation
htslib copied to clipboard

tabix returns row from VCF file multiple times

Open jwimberl opened this issue 1 year ago • 4 comments

Using tabix version 1.9 on CentOS 7

I am querying a public indexed dataset (gnomad annotations from the Broad institute) using tabix together with a regions file in BED format, and found that that in some cases a matching row may be returned multiple time. Two cases in which no duplicates of a particular annotation at position 1007743 are returned are when the BED file contains a single region (region.bed)

chr1	1007743	1007748

or a disjoint region on the left (region_flL.bed):

chr1	1007736	1007739
chr1	1007743	1007748

However, adding a disjoint region on the right leads to the issue (region_flR.bed):

chr1	1007743	1007748
chr1	1007760	1007761

The following script reproduces the issue (assuming AWS credentials have been loaded into the appropriate environment variables):

#!/bin/bash
tabix -hR region.bed s3://gnomad-public-us-east-1/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz > annotation.vcf
tabix -hR region_flL.bed s3://gnomad-public-us-east-1/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz > annotation_flL.vcf
tabix -hR region_flR.bed s3://gnomad-public-us-east-1/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz > annotation_flR.vcf
echo "Number of matches for single-line bed:"
grep "1007743.*[^A-Z]ATTCTTTTTTTTTTTTTTT[^A-Z]" annotation.vcf | wc -l
grep "1007743.*[^A-Z]ATTCTTTTTTTTTTTTTTT[^A-Z]" annotation.vcf
echo "Number of matches including left-flanking disjoint region: "
grep "1007743.*[^A-Z]ATTCTTTTTTTTTTTTTTT[^A-Z]" annotation_flL.vcf | wc -l
grep "1007743.*[^A-Z]ATTCTTTTTTTTTTTTTTT[^A-Z]" annotation_flL.vcf
echo "Number of matches including right-flanking disjoint region -- one extra: "
grep "1007743.*[^A-Z]ATTCTTTTTTTTTTTTTTT[^A-Z]" annotation_flR.vcf | wc -l
grep "1007743.*[^A-Z]ATTCTTTTTTTTTTTTTTT[^A-Z]" annotation_flR.vcf

Output:

[jwimberley@ip-172-30-1-233 tabix_repro]$ ./test.sh
Number of matches for single-line bed:
1
chr1	1007743	rs1489042997	ATTCTTTTTTTTTTTTTTT	A	160718.00	PASS	AC=246;AN=122076;AF=2.01514e-03;variant_type=indel;n_alt_alleles=1;ReadPosRankSum=2.97000e-01;MQRankSum=2.65000e-01;RAW_MQ=3.59036e+07;DP=9561;MQ_DP=11066;VarDP=9561;MQ=5.69605e+01;QD=1.68097e+01;FS=9.16326e-01;SB=22926,20387,2389,2450;InbreedingCoeff=2.83648e-02;AS_VQSLOD=8.79900e-01;NEGATIVE_TRAIN_SITE;culprit=AS_FS;SOR=6.11000e-01;AC_asj_female=0;AN_asj_female=1630;AF_asj_female=0.00000e+00;nhomalt_asj_female=0;AC_eas_female=0;AN_eas_female=1304;AF_eas_female=0.00000e+00;nhomalt_eas_female=0;AC_afr_male=123;AN_afr_male=15038;AF_afr_male=8.17928e-03;nhomalt_afr_male=1;AC_female=120;AN_female=64330;AF_female=1.86538e-03;nhomalt_female=1;AC_fin_male=0;AN_fin_male=5398;AF_fin_male=0.00000e+00;nhomalt_fin_male=0;AC_oth_female=0;AN_oth_female=972;AF_oth_female=0.00000e+00;nhomalt_oth_female=0;AC_ami=0;AN_ami=846;AF_ami=0.00000e+00;nhomalt_ami=0;AC_oth=0;AN_oth=1824;AF_oth=0.00000e+00;nhomalt_oth=0;AC_male=126;AN_male=57746;AF_male=2.18197e-03;nhomalt_male=1;AC_ami_female=0;AN_ami_female=442;AF_ami_female=0.00000e+00;nhomalt_ami_female=0;AC_afr=240;AN_afr=32686;AF_afr=7.34259e-03;nhomalt_afr=2;AC_eas_male=0;AN_eas_male=1464;AF_eas_male=0.00000e+00;nhomalt_eas_male=0;AC_sas=0;AN_sas=2656;AF_sas=0.00000e+00;nhomalt_sas=0;AC_nfe_female=0;AN_nfe_female=35286;AF_nfe_female=0.00000e+00;nhomalt_nfe_female=0;AC_asj_male=0;AN_asj_male=1468;AF_asj_male=0.00000e+00;nhomalt_asj_male=0;AC_raw=264;AN_raw=132340;AF_raw=1.99486e-03;nhomalt_raw=4;AC_oth_male=0;AN_oth_male=852;AF_oth_male=0.00000e+00;nhomalt_oth_male=0;AC_nfe_male=0;AN_nfe_male=25084;AF_nfe_male=0.00000e+00;nhomalt_nfe_male=0;AC_asj=0;AN_asj=3098;AF_asj=0.00000e+00;nhomalt_asj=0;AC_amr_male=3;AN_amr_male=5862;AF_amr_male=5.11771e-04;nhomalt_amr_male=0;nhomalt=2;AC_amr_female=3;AN_amr_female=4822;AF_amr_female=6.22148e-04;nhomalt_amr_female=0;AC_sas_female=0;AN_sas_female=480;AF_sas_female=0.00000e+00;nhomalt_sas_female=0;AC_fin=0;AN_fin=7144;AF_fin=0.00000e+00;nhomalt_fin=0;AC_afr_female=117;AN_afr_female=17648;AF_afr_female=6.62965e-03;nhomalt_afr_female=1;AC_sas_male=0;AN_sas_male=2176;AF_sas_male=0.00000e+00;nhomalt_sas_male=0;AC_amr=6;AN_amr=10684;AF_amr=5.61587e-04;nhomalt_amr=0;AC_nfe=0;AN_nfe=60370;AF_nfe=0.00000e+00;nhomalt_nfe=0;AC_eas=0;AN_eas=2768;AF_eas=0.00000e+00;nhomalt_eas=0;AC_ami_male=0;AN_ami_male=404;AF_ami_male=0.00000e+00;nhomalt_ami_male=0;AC_fin_female=0;AN_fin_female=1746;AF_fin_female=0.00000e+00;nhomalt_fin_female=0;faf95_afr=6.58052e-03;faf99_afr=6.58026e-03;faf95_sas=0.00000e+00;faf99_sas=0.00000e+00;faf95_adj=1.80839e-03;faf99_adj=1.80769e-03;faf95_amr=2.43960e-04;faf99_amr=2.43860e-04;faf95_nfe=0.00000e+00;faf99_nfe=0.00000e+00;faf95_eas=0.00000e+00;faf99_eas=0.00000e+00;vep=-|downstream_gene_variant|MODIFIER|AL645608.3|ENSG00000231702|Transcript|ENST00000423619|processed_pseudogene|||,-|downstream_gene_variant|MODIFIER|AL645608.1|ENSG00000224969|Transcript|ENST00000458555|antisense|||,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624652|protein_coding||1/2|,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624697|protein_coding||1/2|
Number of matches including left-flanking disjoint region: 
1
chr1	1007743	rs1489042997	ATTCTTTTTTTTTTTTTTT	A	160718.00	PASS	AC=246;AN=122076;AF=2.01514e-03;variant_type=indel;n_alt_alleles=1;ReadPosRankSum=2.97000e-01;MQRankSum=2.65000e-01;RAW_MQ=3.59036e+07;DP=9561;MQ_DP=11066;VarDP=9561;MQ=5.69605e+01;QD=1.68097e+01;FS=9.16326e-01;SB=22926,20387,2389,2450;InbreedingCoeff=2.83648e-02;AS_VQSLOD=8.79900e-01;NEGATIVE_TRAIN_SITE;culprit=AS_FS;SOR=6.11000e-01;AC_asj_female=0;AN_asj_female=1630;AF_asj_female=0.00000e+00;nhomalt_asj_female=0;AC_eas_female=0;AN_eas_female=1304;AF_eas_female=0.00000e+00;nhomalt_eas_female=0;AC_afr_male=123;AN_afr_male=15038;AF_afr_male=8.17928e-03;nhomalt_afr_male=1;AC_female=120;AN_female=64330;AF_female=1.86538e-03;nhomalt_female=1;AC_fin_male=0;AN_fin_male=5398;AF_fin_male=0.00000e+00;nhomalt_fin_male=0;AC_oth_female=0;AN_oth_female=972;AF_oth_female=0.00000e+00;nhomalt_oth_female=0;AC_ami=0;AN_ami=846;AF_ami=0.00000e+00;nhomalt_ami=0;AC_oth=0;AN_oth=1824;AF_oth=0.00000e+00;nhomalt_oth=0;AC_male=126;AN_male=57746;AF_male=2.18197e-03;nhomalt_male=1;AC_ami_female=0;AN_ami_female=442;AF_ami_female=0.00000e+00;nhomalt_ami_female=0;AC_afr=240;AN_afr=32686;AF_afr=7.34259e-03;nhomalt_afr=2;AC_eas_male=0;AN_eas_male=1464;AF_eas_male=0.00000e+00;nhomalt_eas_male=0;AC_sas=0;AN_sas=2656;AF_sas=0.00000e+00;nhomalt_sas=0;AC_nfe_female=0;AN_nfe_female=35286;AF_nfe_female=0.00000e+00;nhomalt_nfe_female=0;AC_asj_male=0;AN_asj_male=1468;AF_asj_male=0.00000e+00;nhomalt_asj_male=0;AC_raw=264;AN_raw=132340;AF_raw=1.99486e-03;nhomalt_raw=4;AC_oth_male=0;AN_oth_male=852;AF_oth_male=0.00000e+00;nhomalt_oth_male=0;AC_nfe_male=0;AN_nfe_male=25084;AF_nfe_male=0.00000e+00;nhomalt_nfe_male=0;AC_asj=0;AN_asj=3098;AF_asj=0.00000e+00;nhomalt_asj=0;AC_amr_male=3;AN_amr_male=5862;AF_amr_male=5.11771e-04;nhomalt_amr_male=0;nhomalt=2;AC_amr_female=3;AN_amr_female=4822;AF_amr_female=6.22148e-04;nhomalt_amr_female=0;AC_sas_female=0;AN_sas_female=480;AF_sas_female=0.00000e+00;nhomalt_sas_female=0;AC_fin=0;AN_fin=7144;AF_fin=0.00000e+00;nhomalt_fin=0;AC_afr_female=117;AN_afr_female=17648;AF_afr_female=6.62965e-03;nhomalt_afr_female=1;AC_sas_male=0;AN_sas_male=2176;AF_sas_male=0.00000e+00;nhomalt_sas_male=0;AC_amr=6;AN_amr=10684;AF_amr=5.61587e-04;nhomalt_amr=0;AC_nfe=0;AN_nfe=60370;AF_nfe=0.00000e+00;nhomalt_nfe=0;AC_eas=0;AN_eas=2768;AF_eas=0.00000e+00;nhomalt_eas=0;AC_ami_male=0;AN_ami_male=404;AF_ami_male=0.00000e+00;nhomalt_ami_male=0;AC_fin_female=0;AN_fin_female=1746;AF_fin_female=0.00000e+00;nhomalt_fin_female=0;faf95_afr=6.58052e-03;faf99_afr=6.58026e-03;faf95_sas=0.00000e+00;faf99_sas=0.00000e+00;faf95_adj=1.80839e-03;faf99_adj=1.80769e-03;faf95_amr=2.43960e-04;faf99_amr=2.43860e-04;faf95_nfe=0.00000e+00;faf99_nfe=0.00000e+00;faf95_eas=0.00000e+00;faf99_eas=0.00000e+00;vep=-|downstream_gene_variant|MODIFIER|AL645608.3|ENSG00000231702|Transcript|ENST00000423619|processed_pseudogene|||,-|downstream_gene_variant|MODIFIER|AL645608.1|ENSG00000224969|Transcript|ENST00000458555|antisense|||,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624652|protein_coding||1/2|,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624697|protein_coding||1/2|
Number of matches including right-flanking disjoint region -- one extra: 
2
chr1	1007743	rs1489042997	ATTCTTTTTTTTTTTTTTT	A	160718.00	PASS	AC=246;AN=122076;AF=2.01514e-03;variant_type=indel;n_alt_alleles=1;ReadPosRankSum=2.97000e-01;MQRankSum=2.65000e-01;RAW_MQ=3.59036e+07;DP=9561;MQ_DP=11066;VarDP=9561;MQ=5.69605e+01;QD=1.68097e+01;FS=9.16326e-01;SB=22926,20387,2389,2450;InbreedingCoeff=2.83648e-02;AS_VQSLOD=8.79900e-01;NEGATIVE_TRAIN_SITE;culprit=AS_FS;SOR=6.11000e-01;AC_asj_female=0;AN_asj_female=1630;AF_asj_female=0.00000e+00;nhomalt_asj_female=0;AC_eas_female=0;AN_eas_female=1304;AF_eas_female=0.00000e+00;nhomalt_eas_female=0;AC_afr_male=123;AN_afr_male=15038;AF_afr_male=8.17928e-03;nhomalt_afr_male=1;AC_female=120;AN_female=64330;AF_female=1.86538e-03;nhomalt_female=1;AC_fin_male=0;AN_fin_male=5398;AF_fin_male=0.00000e+00;nhomalt_fin_male=0;AC_oth_female=0;AN_oth_female=972;AF_oth_female=0.00000e+00;nhomalt_oth_female=0;AC_ami=0;AN_ami=846;AF_ami=0.00000e+00;nhomalt_ami=0;AC_oth=0;AN_oth=1824;AF_oth=0.00000e+00;nhomalt_oth=0;AC_male=126;AN_male=57746;AF_male=2.18197e-03;nhomalt_male=1;AC_ami_female=0;AN_ami_female=442;AF_ami_female=0.00000e+00;nhomalt_ami_female=0;AC_afr=240;AN_afr=32686;AF_afr=7.34259e-03;nhomalt_afr=2;AC_eas_male=0;AN_eas_male=1464;AF_eas_male=0.00000e+00;nhomalt_eas_male=0;AC_sas=0;AN_sas=2656;AF_sas=0.00000e+00;nhomalt_sas=0;AC_nfe_female=0;AN_nfe_female=35286;AF_nfe_female=0.00000e+00;nhomalt_nfe_female=0;AC_asj_male=0;AN_asj_male=1468;AF_asj_male=0.00000e+00;nhomalt_asj_male=0;AC_raw=264;AN_raw=132340;AF_raw=1.99486e-03;nhomalt_raw=4;AC_oth_male=0;AN_oth_male=852;AF_oth_male=0.00000e+00;nhomalt_oth_male=0;AC_nfe_male=0;AN_nfe_male=25084;AF_nfe_male=0.00000e+00;nhomalt_nfe_male=0;AC_asj=0;AN_asj=3098;AF_asj=0.00000e+00;nhomalt_asj=0;AC_amr_male=3;AN_amr_male=5862;AF_amr_male=5.11771e-04;nhomalt_amr_male=0;nhomalt=2;AC_amr_female=3;AN_amr_female=4822;AF_amr_female=6.22148e-04;nhomalt_amr_female=0;AC_sas_female=0;AN_sas_female=480;AF_sas_female=0.00000e+00;nhomalt_sas_female=0;AC_fin=0;AN_fin=7144;AF_fin=0.00000e+00;nhomalt_fin=0;AC_afr_female=117;AN_afr_female=17648;AF_afr_female=6.62965e-03;nhomalt_afr_female=1;AC_sas_male=0;AN_sas_male=2176;AF_sas_male=0.00000e+00;nhomalt_sas_male=0;AC_amr=6;AN_amr=10684;AF_amr=5.61587e-04;nhomalt_amr=0;AC_nfe=0;AN_nfe=60370;AF_nfe=0.00000e+00;nhomalt_nfe=0;AC_eas=0;AN_eas=2768;AF_eas=0.00000e+00;nhomalt_eas=0;AC_ami_male=0;AN_ami_male=404;AF_ami_male=0.00000e+00;nhomalt_ami_male=0;AC_fin_female=0;AN_fin_female=1746;AF_fin_female=0.00000e+00;nhomalt_fin_female=0;faf95_afr=6.58052e-03;faf99_afr=6.58026e-03;faf95_sas=0.00000e+00;faf99_sas=0.00000e+00;faf95_adj=1.80839e-03;faf99_adj=1.80769e-03;faf95_amr=2.43960e-04;faf99_amr=2.43860e-04;faf95_nfe=0.00000e+00;faf99_nfe=0.00000e+00;faf95_eas=0.00000e+00;faf99_eas=0.00000e+00;vep=-|downstream_gene_variant|MODIFIER|AL645608.3|ENSG00000231702|Transcript|ENST00000423619|processed_pseudogene|||,-|downstream_gene_variant|MODIFIER|AL645608.1|ENSG00000224969|Transcript|ENST00000458555|antisense|||,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624652|protein_coding||1/2|,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624697|protein_coding||1/2|
chr1	1007743	rs1489042997	ATTCTTTTTTTTTTTTTTT	A	160718.00	PASS	AC=246;AN=122076;AF=2.01514e-03;variant_type=indel;n_alt_alleles=1;ReadPosRankSum=2.97000e-01;MQRankSum=2.65000e-01;RAW_MQ=3.59036e+07;DP=9561;MQ_DP=11066;VarDP=9561;MQ=5.69605e+01;QD=1.68097e+01;FS=9.16326e-01;SB=22926,20387,2389,2450;InbreedingCoeff=2.83648e-02;AS_VQSLOD=8.79900e-01;NEGATIVE_TRAIN_SITE;culprit=AS_FS;SOR=6.11000e-01;AC_asj_female=0;AN_asj_female=1630;AF_asj_female=0.00000e+00;nhomalt_asj_female=0;AC_eas_female=0;AN_eas_female=1304;AF_eas_female=0.00000e+00;nhomalt_eas_female=0;AC_afr_male=123;AN_afr_male=15038;AF_afr_male=8.17928e-03;nhomalt_afr_male=1;AC_female=120;AN_female=64330;AF_female=1.86538e-03;nhomalt_female=1;AC_fin_male=0;AN_fin_male=5398;AF_fin_male=0.00000e+00;nhomalt_fin_male=0;AC_oth_female=0;AN_oth_female=972;AF_oth_female=0.00000e+00;nhomalt_oth_female=0;AC_ami=0;AN_ami=846;AF_ami=0.00000e+00;nhomalt_ami=0;AC_oth=0;AN_oth=1824;AF_oth=0.00000e+00;nhomalt_oth=0;AC_male=126;AN_male=57746;AF_male=2.18197e-03;nhomalt_male=1;AC_ami_female=0;AN_ami_female=442;AF_ami_female=0.00000e+00;nhomalt_ami_female=0;AC_afr=240;AN_afr=32686;AF_afr=7.34259e-03;nhomalt_afr=2;AC_eas_male=0;AN_eas_male=1464;AF_eas_male=0.00000e+00;nhomalt_eas_male=0;AC_sas=0;AN_sas=2656;AF_sas=0.00000e+00;nhomalt_sas=0;AC_nfe_female=0;AN_nfe_female=35286;AF_nfe_female=0.00000e+00;nhomalt_nfe_female=0;AC_asj_male=0;AN_asj_male=1468;AF_asj_male=0.00000e+00;nhomalt_asj_male=0;AC_raw=264;AN_raw=132340;AF_raw=1.99486e-03;nhomalt_raw=4;AC_oth_male=0;AN_oth_male=852;AF_oth_male=0.00000e+00;nhomalt_oth_male=0;AC_nfe_male=0;AN_nfe_male=25084;AF_nfe_male=0.00000e+00;nhomalt_nfe_male=0;AC_asj=0;AN_asj=3098;AF_asj=0.00000e+00;nhomalt_asj=0;AC_amr_male=3;AN_amr_male=5862;AF_amr_male=5.11771e-04;nhomalt_amr_male=0;nhomalt=2;AC_amr_female=3;AN_amr_female=4822;AF_amr_female=6.22148e-04;nhomalt_amr_female=0;AC_sas_female=0;AN_sas_female=480;AF_sas_female=0.00000e+00;nhomalt_sas_female=0;AC_fin=0;AN_fin=7144;AF_fin=0.00000e+00;nhomalt_fin=0;AC_afr_female=117;AN_afr_female=17648;AF_afr_female=6.62965e-03;nhomalt_afr_female=1;AC_sas_male=0;AN_sas_male=2176;AF_sas_male=0.00000e+00;nhomalt_sas_male=0;AC_amr=6;AN_amr=10684;AF_amr=5.61587e-04;nhomalt_amr=0;AC_nfe=0;AN_nfe=60370;AF_nfe=0.00000e+00;nhomalt_nfe=0;AC_eas=0;AN_eas=2768;AF_eas=0.00000e+00;nhomalt_eas=0;AC_ami_male=0;AN_ami_male=404;AF_ami_male=0.00000e+00;nhomalt_ami_male=0;AC_fin_female=0;AN_fin_female=1746;AF_fin_female=0.00000e+00;nhomalt_fin_female=0;faf95_afr=6.58052e-03;faf99_afr=6.58026e-03;faf95_sas=0.00000e+00;faf99_sas=0.00000e+00;faf95_adj=1.80839e-03;faf99_adj=1.80769e-03;faf95_amr=2.43960e-04;faf99_amr=2.43860e-04;faf95_nfe=0.00000e+00;faf99_nfe=0.00000e+00;faf95_eas=0.00000e+00;faf99_eas=0.00000e+00;vep=-|downstream_gene_variant|MODIFIER|AL645608.3|ENSG00000231702|Transcript|ENST00000423619|processed_pseudogene|||,-|downstream_gene_variant|MODIFIER|AL645608.1|ENSG00000224969|Transcript|ENST00000458555|antisense|||,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624652|protein_coding||1/2|,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624697|protein_coding||1/2|

jwimberl avatar Jul 11 '22 16:07 jwimberl

It's been pointed out to me immediately that this is expected behavior as the length of the reference means that it overlaps the right flanking region in the third BED file, as opposed to my assumption that the VCF indexing would be strictly on the indicated starting position of the variant.

jwimberl avatar Jul 11 '22 16:07 jwimberl

Correct, VCF indices, like BAM, have overlap calculations rather than point sources, so you can find all variants that overlap a region.

In bcftools, if you use a combination of a broad region (-R, which uses indices) and a target list (-T, which is a streaming filter) then the duplicates will be removed. Bcftools also has a --regions-overlap options to control pos vs overlap decisions, but I've never used it so I don't know how appropriate it is.

Edit: just remembered this is tabix, which likely doesn't have the same options, but bcftools will - obviously - also view VCF files. :-)

jkbonfield avatar Jul 12 '22 08:07 jkbonfield

I'm not really clear on whether multi-region iterators work in general or just for reads in BAM/CRAM files — e.g., there's an hts_itr_regions() and a sam_itr_regions() but no bcf_itr_regions() or tbx_itr_regions() and bcftools doesn't appear to use anything of the sort.

Anyway, if they do work in general, then this issue would be solved by tabix having an option to use multi-region iterators. And if they don't, then perhaps one day they should. :smile:

jmarshall avatar Jul 12 '22 09:07 jmarshall

Thank you both for your helpful suggestions -- using bcftools in place of tabix is an option for me, and I've confirmed that specifying the BED format regions file together with --regions-overlap record is equivalent to the running the tabix query and removing duplicates.

jwimberl avatar Jul 12 '22 15:07 jwimberl