htslib
htslib copied to clipboard
tabix returns row from VCF file multiple times
Using tabix version 1.9 on CentOS 7
I am querying a public indexed dataset (gnomad annotations from the Broad institute) using tabix together with a regions file in BED format, and found that that in some cases a matching row may be returned multiple time. Two cases in which no duplicates of a particular annotation at position 1007743 are returned are when the BED file contains a single region (region.bed
)
chr1 1007743 1007748
or a disjoint region on the left (region_flL.bed
):
chr1 1007736 1007739
chr1 1007743 1007748
However, adding a disjoint region on the right leads to the issue (region_flR.bed
):
chr1 1007743 1007748
chr1 1007760 1007761
The following script reproduces the issue (assuming AWS credentials have been loaded into the appropriate environment variables):
#!/bin/bash
tabix -hR region.bed s3://gnomad-public-us-east-1/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz > annotation.vcf
tabix -hR region_flL.bed s3://gnomad-public-us-east-1/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz > annotation_flL.vcf
tabix -hR region_flR.bed s3://gnomad-public-us-east-1/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz > annotation_flR.vcf
echo "Number of matches for single-line bed:"
grep "1007743.*[^A-Z]ATTCTTTTTTTTTTTTTTT[^A-Z]" annotation.vcf | wc -l
grep "1007743.*[^A-Z]ATTCTTTTTTTTTTTTTTT[^A-Z]" annotation.vcf
echo "Number of matches including left-flanking disjoint region: "
grep "1007743.*[^A-Z]ATTCTTTTTTTTTTTTTTT[^A-Z]" annotation_flL.vcf | wc -l
grep "1007743.*[^A-Z]ATTCTTTTTTTTTTTTTTT[^A-Z]" annotation_flL.vcf
echo "Number of matches including right-flanking disjoint region -- one extra: "
grep "1007743.*[^A-Z]ATTCTTTTTTTTTTTTTTT[^A-Z]" annotation_flR.vcf | wc -l
grep "1007743.*[^A-Z]ATTCTTTTTTTTTTTTTTT[^A-Z]" annotation_flR.vcf
Output:
[jwimberley@ip-172-30-1-233 tabix_repro]$ ./test.sh
Number of matches for single-line bed:
1
chr1 1007743 rs1489042997 ATTCTTTTTTTTTTTTTTT A 160718.00 PASS AC=246;AN=122076;AF=2.01514e-03;variant_type=indel;n_alt_alleles=1;ReadPosRankSum=2.97000e-01;MQRankSum=2.65000e-01;RAW_MQ=3.59036e+07;DP=9561;MQ_DP=11066;VarDP=9561;MQ=5.69605e+01;QD=1.68097e+01;FS=9.16326e-01;SB=22926,20387,2389,2450;InbreedingCoeff=2.83648e-02;AS_VQSLOD=8.79900e-01;NEGATIVE_TRAIN_SITE;culprit=AS_FS;SOR=6.11000e-01;AC_asj_female=0;AN_asj_female=1630;AF_asj_female=0.00000e+00;nhomalt_asj_female=0;AC_eas_female=0;AN_eas_female=1304;AF_eas_female=0.00000e+00;nhomalt_eas_female=0;AC_afr_male=123;AN_afr_male=15038;AF_afr_male=8.17928e-03;nhomalt_afr_male=1;AC_female=120;AN_female=64330;AF_female=1.86538e-03;nhomalt_female=1;AC_fin_male=0;AN_fin_male=5398;AF_fin_male=0.00000e+00;nhomalt_fin_male=0;AC_oth_female=0;AN_oth_female=972;AF_oth_female=0.00000e+00;nhomalt_oth_female=0;AC_ami=0;AN_ami=846;AF_ami=0.00000e+00;nhomalt_ami=0;AC_oth=0;AN_oth=1824;AF_oth=0.00000e+00;nhomalt_oth=0;AC_male=126;AN_male=57746;AF_male=2.18197e-03;nhomalt_male=1;AC_ami_female=0;AN_ami_female=442;AF_ami_female=0.00000e+00;nhomalt_ami_female=0;AC_afr=240;AN_afr=32686;AF_afr=7.34259e-03;nhomalt_afr=2;AC_eas_male=0;AN_eas_male=1464;AF_eas_male=0.00000e+00;nhomalt_eas_male=0;AC_sas=0;AN_sas=2656;AF_sas=0.00000e+00;nhomalt_sas=0;AC_nfe_female=0;AN_nfe_female=35286;AF_nfe_female=0.00000e+00;nhomalt_nfe_female=0;AC_asj_male=0;AN_asj_male=1468;AF_asj_male=0.00000e+00;nhomalt_asj_male=0;AC_raw=264;AN_raw=132340;AF_raw=1.99486e-03;nhomalt_raw=4;AC_oth_male=0;AN_oth_male=852;AF_oth_male=0.00000e+00;nhomalt_oth_male=0;AC_nfe_male=0;AN_nfe_male=25084;AF_nfe_male=0.00000e+00;nhomalt_nfe_male=0;AC_asj=0;AN_asj=3098;AF_asj=0.00000e+00;nhomalt_asj=0;AC_amr_male=3;AN_amr_male=5862;AF_amr_male=5.11771e-04;nhomalt_amr_male=0;nhomalt=2;AC_amr_female=3;AN_amr_female=4822;AF_amr_female=6.22148e-04;nhomalt_amr_female=0;AC_sas_female=0;AN_sas_female=480;AF_sas_female=0.00000e+00;nhomalt_sas_female=0;AC_fin=0;AN_fin=7144;AF_fin=0.00000e+00;nhomalt_fin=0;AC_afr_female=117;AN_afr_female=17648;AF_afr_female=6.62965e-03;nhomalt_afr_female=1;AC_sas_male=0;AN_sas_male=2176;AF_sas_male=0.00000e+00;nhomalt_sas_male=0;AC_amr=6;AN_amr=10684;AF_amr=5.61587e-04;nhomalt_amr=0;AC_nfe=0;AN_nfe=60370;AF_nfe=0.00000e+00;nhomalt_nfe=0;AC_eas=0;AN_eas=2768;AF_eas=0.00000e+00;nhomalt_eas=0;AC_ami_male=0;AN_ami_male=404;AF_ami_male=0.00000e+00;nhomalt_ami_male=0;AC_fin_female=0;AN_fin_female=1746;AF_fin_female=0.00000e+00;nhomalt_fin_female=0;faf95_afr=6.58052e-03;faf99_afr=6.58026e-03;faf95_sas=0.00000e+00;faf99_sas=0.00000e+00;faf95_adj=1.80839e-03;faf99_adj=1.80769e-03;faf95_amr=2.43960e-04;faf99_amr=2.43860e-04;faf95_nfe=0.00000e+00;faf99_nfe=0.00000e+00;faf95_eas=0.00000e+00;faf99_eas=0.00000e+00;vep=-|downstream_gene_variant|MODIFIER|AL645608.3|ENSG00000231702|Transcript|ENST00000423619|processed_pseudogene|||,-|downstream_gene_variant|MODIFIER|AL645608.1|ENSG00000224969|Transcript|ENST00000458555|antisense|||,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624652|protein_coding||1/2|,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624697|protein_coding||1/2|
Number of matches including left-flanking disjoint region:
1
chr1 1007743 rs1489042997 ATTCTTTTTTTTTTTTTTT A 160718.00 PASS AC=246;AN=122076;AF=2.01514e-03;variant_type=indel;n_alt_alleles=1;ReadPosRankSum=2.97000e-01;MQRankSum=2.65000e-01;RAW_MQ=3.59036e+07;DP=9561;MQ_DP=11066;VarDP=9561;MQ=5.69605e+01;QD=1.68097e+01;FS=9.16326e-01;SB=22926,20387,2389,2450;InbreedingCoeff=2.83648e-02;AS_VQSLOD=8.79900e-01;NEGATIVE_TRAIN_SITE;culprit=AS_FS;SOR=6.11000e-01;AC_asj_female=0;AN_asj_female=1630;AF_asj_female=0.00000e+00;nhomalt_asj_female=0;AC_eas_female=0;AN_eas_female=1304;AF_eas_female=0.00000e+00;nhomalt_eas_female=0;AC_afr_male=123;AN_afr_male=15038;AF_afr_male=8.17928e-03;nhomalt_afr_male=1;AC_female=120;AN_female=64330;AF_female=1.86538e-03;nhomalt_female=1;AC_fin_male=0;AN_fin_male=5398;AF_fin_male=0.00000e+00;nhomalt_fin_male=0;AC_oth_female=0;AN_oth_female=972;AF_oth_female=0.00000e+00;nhomalt_oth_female=0;AC_ami=0;AN_ami=846;AF_ami=0.00000e+00;nhomalt_ami=0;AC_oth=0;AN_oth=1824;AF_oth=0.00000e+00;nhomalt_oth=0;AC_male=126;AN_male=57746;AF_male=2.18197e-03;nhomalt_male=1;AC_ami_female=0;AN_ami_female=442;AF_ami_female=0.00000e+00;nhomalt_ami_female=0;AC_afr=240;AN_afr=32686;AF_afr=7.34259e-03;nhomalt_afr=2;AC_eas_male=0;AN_eas_male=1464;AF_eas_male=0.00000e+00;nhomalt_eas_male=0;AC_sas=0;AN_sas=2656;AF_sas=0.00000e+00;nhomalt_sas=0;AC_nfe_female=0;AN_nfe_female=35286;AF_nfe_female=0.00000e+00;nhomalt_nfe_female=0;AC_asj_male=0;AN_asj_male=1468;AF_asj_male=0.00000e+00;nhomalt_asj_male=0;AC_raw=264;AN_raw=132340;AF_raw=1.99486e-03;nhomalt_raw=4;AC_oth_male=0;AN_oth_male=852;AF_oth_male=0.00000e+00;nhomalt_oth_male=0;AC_nfe_male=0;AN_nfe_male=25084;AF_nfe_male=0.00000e+00;nhomalt_nfe_male=0;AC_asj=0;AN_asj=3098;AF_asj=0.00000e+00;nhomalt_asj=0;AC_amr_male=3;AN_amr_male=5862;AF_amr_male=5.11771e-04;nhomalt_amr_male=0;nhomalt=2;AC_amr_female=3;AN_amr_female=4822;AF_amr_female=6.22148e-04;nhomalt_amr_female=0;AC_sas_female=0;AN_sas_female=480;AF_sas_female=0.00000e+00;nhomalt_sas_female=0;AC_fin=0;AN_fin=7144;AF_fin=0.00000e+00;nhomalt_fin=0;AC_afr_female=117;AN_afr_female=17648;AF_afr_female=6.62965e-03;nhomalt_afr_female=1;AC_sas_male=0;AN_sas_male=2176;AF_sas_male=0.00000e+00;nhomalt_sas_male=0;AC_amr=6;AN_amr=10684;AF_amr=5.61587e-04;nhomalt_amr=0;AC_nfe=0;AN_nfe=60370;AF_nfe=0.00000e+00;nhomalt_nfe=0;AC_eas=0;AN_eas=2768;AF_eas=0.00000e+00;nhomalt_eas=0;AC_ami_male=0;AN_ami_male=404;AF_ami_male=0.00000e+00;nhomalt_ami_male=0;AC_fin_female=0;AN_fin_female=1746;AF_fin_female=0.00000e+00;nhomalt_fin_female=0;faf95_afr=6.58052e-03;faf99_afr=6.58026e-03;faf95_sas=0.00000e+00;faf99_sas=0.00000e+00;faf95_adj=1.80839e-03;faf99_adj=1.80769e-03;faf95_amr=2.43960e-04;faf99_amr=2.43860e-04;faf95_nfe=0.00000e+00;faf99_nfe=0.00000e+00;faf95_eas=0.00000e+00;faf99_eas=0.00000e+00;vep=-|downstream_gene_variant|MODIFIER|AL645608.3|ENSG00000231702|Transcript|ENST00000423619|processed_pseudogene|||,-|downstream_gene_variant|MODIFIER|AL645608.1|ENSG00000224969|Transcript|ENST00000458555|antisense|||,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624652|protein_coding||1/2|,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624697|protein_coding||1/2|
Number of matches including right-flanking disjoint region -- one extra:
2
chr1 1007743 rs1489042997 ATTCTTTTTTTTTTTTTTT A 160718.00 PASS AC=246;AN=122076;AF=2.01514e-03;variant_type=indel;n_alt_alleles=1;ReadPosRankSum=2.97000e-01;MQRankSum=2.65000e-01;RAW_MQ=3.59036e+07;DP=9561;MQ_DP=11066;VarDP=9561;MQ=5.69605e+01;QD=1.68097e+01;FS=9.16326e-01;SB=22926,20387,2389,2450;InbreedingCoeff=2.83648e-02;AS_VQSLOD=8.79900e-01;NEGATIVE_TRAIN_SITE;culprit=AS_FS;SOR=6.11000e-01;AC_asj_female=0;AN_asj_female=1630;AF_asj_female=0.00000e+00;nhomalt_asj_female=0;AC_eas_female=0;AN_eas_female=1304;AF_eas_female=0.00000e+00;nhomalt_eas_female=0;AC_afr_male=123;AN_afr_male=15038;AF_afr_male=8.17928e-03;nhomalt_afr_male=1;AC_female=120;AN_female=64330;AF_female=1.86538e-03;nhomalt_female=1;AC_fin_male=0;AN_fin_male=5398;AF_fin_male=0.00000e+00;nhomalt_fin_male=0;AC_oth_female=0;AN_oth_female=972;AF_oth_female=0.00000e+00;nhomalt_oth_female=0;AC_ami=0;AN_ami=846;AF_ami=0.00000e+00;nhomalt_ami=0;AC_oth=0;AN_oth=1824;AF_oth=0.00000e+00;nhomalt_oth=0;AC_male=126;AN_male=57746;AF_male=2.18197e-03;nhomalt_male=1;AC_ami_female=0;AN_ami_female=442;AF_ami_female=0.00000e+00;nhomalt_ami_female=0;AC_afr=240;AN_afr=32686;AF_afr=7.34259e-03;nhomalt_afr=2;AC_eas_male=0;AN_eas_male=1464;AF_eas_male=0.00000e+00;nhomalt_eas_male=0;AC_sas=0;AN_sas=2656;AF_sas=0.00000e+00;nhomalt_sas=0;AC_nfe_female=0;AN_nfe_female=35286;AF_nfe_female=0.00000e+00;nhomalt_nfe_female=0;AC_asj_male=0;AN_asj_male=1468;AF_asj_male=0.00000e+00;nhomalt_asj_male=0;AC_raw=264;AN_raw=132340;AF_raw=1.99486e-03;nhomalt_raw=4;AC_oth_male=0;AN_oth_male=852;AF_oth_male=0.00000e+00;nhomalt_oth_male=0;AC_nfe_male=0;AN_nfe_male=25084;AF_nfe_male=0.00000e+00;nhomalt_nfe_male=0;AC_asj=0;AN_asj=3098;AF_asj=0.00000e+00;nhomalt_asj=0;AC_amr_male=3;AN_amr_male=5862;AF_amr_male=5.11771e-04;nhomalt_amr_male=0;nhomalt=2;AC_amr_female=3;AN_amr_female=4822;AF_amr_female=6.22148e-04;nhomalt_amr_female=0;AC_sas_female=0;AN_sas_female=480;AF_sas_female=0.00000e+00;nhomalt_sas_female=0;AC_fin=0;AN_fin=7144;AF_fin=0.00000e+00;nhomalt_fin=0;AC_afr_female=117;AN_afr_female=17648;AF_afr_female=6.62965e-03;nhomalt_afr_female=1;AC_sas_male=0;AN_sas_male=2176;AF_sas_male=0.00000e+00;nhomalt_sas_male=0;AC_amr=6;AN_amr=10684;AF_amr=5.61587e-04;nhomalt_amr=0;AC_nfe=0;AN_nfe=60370;AF_nfe=0.00000e+00;nhomalt_nfe=0;AC_eas=0;AN_eas=2768;AF_eas=0.00000e+00;nhomalt_eas=0;AC_ami_male=0;AN_ami_male=404;AF_ami_male=0.00000e+00;nhomalt_ami_male=0;AC_fin_female=0;AN_fin_female=1746;AF_fin_female=0.00000e+00;nhomalt_fin_female=0;faf95_afr=6.58052e-03;faf99_afr=6.58026e-03;faf95_sas=0.00000e+00;faf99_sas=0.00000e+00;faf95_adj=1.80839e-03;faf99_adj=1.80769e-03;faf95_amr=2.43960e-04;faf99_amr=2.43860e-04;faf95_nfe=0.00000e+00;faf99_nfe=0.00000e+00;faf95_eas=0.00000e+00;faf99_eas=0.00000e+00;vep=-|downstream_gene_variant|MODIFIER|AL645608.3|ENSG00000231702|Transcript|ENST00000423619|processed_pseudogene|||,-|downstream_gene_variant|MODIFIER|AL645608.1|ENSG00000224969|Transcript|ENST00000458555|antisense|||,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624652|protein_coding||1/2|,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624697|protein_coding||1/2|
chr1 1007743 rs1489042997 ATTCTTTTTTTTTTTTTTT A 160718.00 PASS AC=246;AN=122076;AF=2.01514e-03;variant_type=indel;n_alt_alleles=1;ReadPosRankSum=2.97000e-01;MQRankSum=2.65000e-01;RAW_MQ=3.59036e+07;DP=9561;MQ_DP=11066;VarDP=9561;MQ=5.69605e+01;QD=1.68097e+01;FS=9.16326e-01;SB=22926,20387,2389,2450;InbreedingCoeff=2.83648e-02;AS_VQSLOD=8.79900e-01;NEGATIVE_TRAIN_SITE;culprit=AS_FS;SOR=6.11000e-01;AC_asj_female=0;AN_asj_female=1630;AF_asj_female=0.00000e+00;nhomalt_asj_female=0;AC_eas_female=0;AN_eas_female=1304;AF_eas_female=0.00000e+00;nhomalt_eas_female=0;AC_afr_male=123;AN_afr_male=15038;AF_afr_male=8.17928e-03;nhomalt_afr_male=1;AC_female=120;AN_female=64330;AF_female=1.86538e-03;nhomalt_female=1;AC_fin_male=0;AN_fin_male=5398;AF_fin_male=0.00000e+00;nhomalt_fin_male=0;AC_oth_female=0;AN_oth_female=972;AF_oth_female=0.00000e+00;nhomalt_oth_female=0;AC_ami=0;AN_ami=846;AF_ami=0.00000e+00;nhomalt_ami=0;AC_oth=0;AN_oth=1824;AF_oth=0.00000e+00;nhomalt_oth=0;AC_male=126;AN_male=57746;AF_male=2.18197e-03;nhomalt_male=1;AC_ami_female=0;AN_ami_female=442;AF_ami_female=0.00000e+00;nhomalt_ami_female=0;AC_afr=240;AN_afr=32686;AF_afr=7.34259e-03;nhomalt_afr=2;AC_eas_male=0;AN_eas_male=1464;AF_eas_male=0.00000e+00;nhomalt_eas_male=0;AC_sas=0;AN_sas=2656;AF_sas=0.00000e+00;nhomalt_sas=0;AC_nfe_female=0;AN_nfe_female=35286;AF_nfe_female=0.00000e+00;nhomalt_nfe_female=0;AC_asj_male=0;AN_asj_male=1468;AF_asj_male=0.00000e+00;nhomalt_asj_male=0;AC_raw=264;AN_raw=132340;AF_raw=1.99486e-03;nhomalt_raw=4;AC_oth_male=0;AN_oth_male=852;AF_oth_male=0.00000e+00;nhomalt_oth_male=0;AC_nfe_male=0;AN_nfe_male=25084;AF_nfe_male=0.00000e+00;nhomalt_nfe_male=0;AC_asj=0;AN_asj=3098;AF_asj=0.00000e+00;nhomalt_asj=0;AC_amr_male=3;AN_amr_male=5862;AF_amr_male=5.11771e-04;nhomalt_amr_male=0;nhomalt=2;AC_amr_female=3;AN_amr_female=4822;AF_amr_female=6.22148e-04;nhomalt_amr_female=0;AC_sas_female=0;AN_sas_female=480;AF_sas_female=0.00000e+00;nhomalt_sas_female=0;AC_fin=0;AN_fin=7144;AF_fin=0.00000e+00;nhomalt_fin=0;AC_afr_female=117;AN_afr_female=17648;AF_afr_female=6.62965e-03;nhomalt_afr_female=1;AC_sas_male=0;AN_sas_male=2176;AF_sas_male=0.00000e+00;nhomalt_sas_male=0;AC_amr=6;AN_amr=10684;AF_amr=5.61587e-04;nhomalt_amr=0;AC_nfe=0;AN_nfe=60370;AF_nfe=0.00000e+00;nhomalt_nfe=0;AC_eas=0;AN_eas=2768;AF_eas=0.00000e+00;nhomalt_eas=0;AC_ami_male=0;AN_ami_male=404;AF_ami_male=0.00000e+00;nhomalt_ami_male=0;AC_fin_female=0;AN_fin_female=1746;AF_fin_female=0.00000e+00;nhomalt_fin_female=0;faf95_afr=6.58052e-03;faf99_afr=6.58026e-03;faf95_sas=0.00000e+00;faf99_sas=0.00000e+00;faf95_adj=1.80839e-03;faf99_adj=1.80769e-03;faf95_amr=2.43960e-04;faf99_amr=2.43860e-04;faf95_nfe=0.00000e+00;faf99_nfe=0.00000e+00;faf95_eas=0.00000e+00;faf99_eas=0.00000e+00;vep=-|downstream_gene_variant|MODIFIER|AL645608.3|ENSG00000231702|Transcript|ENST00000423619|processed_pseudogene|||,-|downstream_gene_variant|MODIFIER|AL645608.1|ENSG00000224969|Transcript|ENST00000458555|antisense|||,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624652|protein_coding||1/2|,-|intron_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624697|protein_coding||1/2|
It's been pointed out to me immediately that this is expected behavior as the length of the reference means that it overlaps the right flanking region in the third BED file, as opposed to my assumption that the VCF indexing would be strictly on the indicated starting position of the variant.
Correct, VCF indices, like BAM, have overlap calculations rather than point sources, so you can find all variants that overlap a region.
In bcftools, if you use a combination of a broad region (-R
, which uses indices) and a target list (-T
, which is a streaming filter) then the duplicates will be removed. Bcftools also has a --regions-overlap
options to control pos vs overlap decisions, but I've never used it so I don't know how appropriate it is.
Edit: just remembered this is tabix, which likely doesn't have the same options, but bcftools will - obviously - also view VCF files. :-)
I'm not really clear on whether multi-region iterators work in general or just for reads in BAM/CRAM files — e.g., there's an hts_itr_regions()
and a sam_itr_regions()
but no bcf_itr_regions()
or tbx_itr_regions()
and bcftools doesn't appear to use anything of the sort.
Anyway, if they do work in general, then this issue would be solved by tabix
having an option to use multi-region iterators. And if they don't, then perhaps one day they should. :smile:
Thank you both for your helpful suggestions -- using bcftools in place of tabix is an option for me, and I've confirmed that specifying the BED format regions file together with --regions-overlap record
is equivalent to the running the tabix query and removing duplicates.