Exomiser
Exomiser copied to clipboard
target, candidate variants are filtered out during exomiser
Hi. This is Sungeun. Thank you for the great software!
While using exomiser, I realized that some variants are filtered out, even though I know that they pass the variant filters. For example, vcf rows with the following variants (rs146768859, rs141509078) are present in the input vcf files, and I know they are coding region, nonsynonymous SNPs, and their allele frequency is below 0.1% . (+ I didn't set any inheritance filters for exomiser analysis) But when I run the exomiser, they are not present in the output vcf.
I tried two approaches such:
- remove all variant filters in analysis yml file (except the AF filter)-> still they are filtered out
- subset input vcf to include regions near SNP (10kb ~ 100kb), and run exomiser on that smaller input vcf -> SNPs nearby are all together filtered out.
Do you have any idea why these variants are filtered out, and how I can solve this?
(HG 19, rs146768859) chr4 | 186456560 | . | G | A | 46780.9 | PASS | AC=2;AF=0.03;AN=2;AS_BaseQRankSum=6.671;AS_FS=1.396;AS_InbreedingCoeff=0.3761;AS_MQ=60;AS_MQRankSum=0;AS_QD=18.95;AS_ReadPosRankSum=0.857;AS_SOR=0.619;BaseQRankSum=3.22;DP=24846;ExcessHet=0.1353;FS=1.396;InbreedingCoeff=0.3761;MLEAC=5;MLEAF=0.03;MQ=60;MQRankSum=0;QD=18.95;ReadPosRankSum=0.095;SOR=0.619 | GT:AD:DP:GQ:PL | 1/1:0,862:.:99:25451,2579,0
(HG19, rs141509078) chr15 | 72191053 | . | T | G | 1002.01 | PASS | AC=2;AF=0.018;AN=2;AS_BaseQRankSum=-1.055;AS_FS=0;AS_InbreedingCoeff=0.6493;AS_MQ=60;AS_MQRankSum=0;AS_QD=24.44;AS_ReadPosRankSum=-1.068;AS_SOR=0.968;BaseQRankSum=-1.019;DP=2402;ExcessHet=0.0402;FS=0;InbreedingCoeff=0.6493;MLEAC=3;MLEAF=0.018;MQ=60;MQRankSum=0;QD=24.44;ReadPosRankSum=-0.957;SOR=0.968 | GT:AD:DP:GQ:PL | 1/1:0,27:.:81:778,81,0
(Example of my yml file)
analysis:
analysisMode: PASS_ONLY
frequencySources:
- THOUSAND_GENOMES
- TOPMED
- UK10K
- ESP_AFRICAN_AMERICAN
- ESP_EUROPEAN_AMERICAN
- ESP_ALL
- EXAC_AFRICAN_INC_AFRICAN_AMERICAN
- EXAC_AMERICAN
- EXAC_SOUTH_ASIAN
- EXAC_EAST_ASIAN
- EXAC_FINNISH
- EXAC_NON_FINNISH_EUROPEAN
- EXAC_OTHER
- GNOMAD_E_AFR
- GNOMAD_E_AMR
- GNOMAD_E_EAS
- GNOMAD_E_FIN
- GNOMAD_E_NFE
- GNOMAD_E_OTH
- GNOMAD_E_SAS
- GNOMAD_G_AFR
- GNOMAD_G_AMR
- GNOMAD_G_EAS
- GNOMAD_G_FIN
- GNOMAD_G_NFE
- GNOMAD_G_OTH
- GNOMAD_G_SAS
genomeAssembly: hg19
hpoIds:
- HP:0002515
- HP:0030234
- HP:0030319
- HP:0030192
- HP:0003693
- HP:0003326
- HP:0003560
inheritanceModes: {}
pathogenicitySources:
- POLYPHEN
- MUTATION_TASTER
- SIFT
ped: null
proband: null
steps:
#- variantEffectFilter:
#remove:
#- FIVE_PRIME_UTR_EXON_VARIANT
#- FIVE_PRIME_UTR_INTRON_VARIANT
#- THREE_PRIME_UTR_EXON_VARIANT
#- THREE_PRIME_UTR_INTRON_VARIANT
#- NON_CODING_TRANSCRIPT_EXON_VARIANT
#- UPSTREAM_GENE_VARIANT
#- INTERGENIC_VARIANT
#- REGULATORY_REGION_VARIANT
#- CODING_TRANSCRIPT_INTRON_VARIANT
#- NON_CODING_TRANSCRIPT_INTRON_VARIANT
#- DOWNSTREAM_GENE_VARIANT
- frequencyFilter:
maxFrequency: 1.0
- pathogenicityFilter:
keepNonPathogenic: true
- inheritanceFilter: {}
- omimPrioritiser: {}
- hiPhivePrioritiser: {}
vcf: /Volumes/exomiser/sample.NoAnno.indelMax70.cov5.chr4.10kb.vcf
outputOptions:
numGenes: 0
outputContributingVariantsOnly: false
outputFormats:
- HTML
- TSV_GENE
- TSV_VARIANT
- VCF
outputPrefix: results/v4/analysis_sample_10kb
When I set a less stringent maxFrequency filter (AF<10%), I realized that they were not filtered out, and their MAX_FREQUENCY values were annotated as rs146768859 - 1.2056% rs141509078 - 1.1757% which is much higher than that in gnomAD or 1KG.
Also, upon closer inspection, both variants had the same DBSNP_FREQ value of 0.0013980001 (in .variants.tsv).
I think there is some bug regarding to allele frequency annotation or filtering. How can we fix this?
Hi Sungeun,
For the first one it looks like the issue is either in the way EXAC calculated the EAS frequency or we have a parsing error when we incorporated it. @julesjacobsen should be able to confirm it is 1.2% in the EXAC legacy data we have used. The simple solution here will be to just remove the EXAC_* sources from your yml file given this is all in GnomAD nowadays.
For the second example can you see which source is generating the 1.1757% frequency. Would be interested to know but again removing it as a source may be a route forward.
Hi @damiansm Thank you for your prompt reply. According to the html file, the frequency source was
gnomAD_G_EAS: 1.1757%,
However when I checked it out on gnomAD browser (v3.1, Genomes), its allele frequency was 0.8478%

(link: https://gnomad.broadinstitute.org/variant/rs141509078?dataset=gnomad_r3)
Although the difference between 1.175% and 0.8478% is not so big, I am a little curious where this difference comes from. And for this case, my sample is from East Asians, and I'm not sure if its okay to remove gnomAD_G_EAS from frequency sources.
Again, thank you for the great software, and if available I'll be waiting for your reply!
Hi Sungeun,
Can you confirm which version of the Exomiser databases you are using so we are sure which version of GnomAD went into our build. I suspect the frequency has flipped between the 1% threshold between GnomAD releases.
Best wishes Damian
On Fri, Nov 13, 2020 at 8:28 AM Sungeun Hong [email protected] wrote:
Hi @damiansm https://github.com/damiansm Thank you for your prompt reply. According to the html file, the frequency source was
gnomAD_G_EAS: 1.1757%,
However when I checked it out on gnomAD browser (v3.1, Genomes), its allele frequency was 0.8478% [image: image] https://user-images.githubusercontent.com/52628609/99045183-4c664f80-25d4-11eb-805a-4dc4d2637478.png (link: https://gnomad.broadinstitute.org/variant/rs141509078?dataset=gnomad_r3)
Although the difference between 1.175% and 0.8478% is not so big, I am a little curious where this difference comes from. And for this case, my sample is from East Asians, and I'm not sure if its okay to remove gnomAD_G_EAS from frequency sources.
Again, thank you for the great software, and if available I'll be waiting for your reply!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/386#issuecomment-726601056, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHO4PGLBCIZOYJUIUJTJ63SPTU3XANCNFSM4TQHFHWQ .
Hi Sungeun, Can you confirm which version of the Exomiser databases you are using so we are sure which version of GnomAD went into our build. I suspect the frequency has flipped between the 1% threshold between GnomAD releases. Best wishes Damian …
Hi,@damiansm the database and exomiser version I'm currently using is as following:
exomiser version: 12.1.0 database version: 2007_hg19
If this problem is due to gnomAD releases, can you recommend a way I can deal with this?
Hi Sungeon,
We will get back to you soon with a more detailed response while we check exactly where this difference came from. For now though the only way you can deal with it in your queries if you want to keep that variant is to up your MAF filtering from 1% to 1.25%. You will obviously get some more false positives though
Best wishes Damian
On Fri, Nov 13, 2020 at 9:38 AM Sungeun Hong [email protected] wrote:
Hi Sungeun, Can you confirm which version of the Exomiser databases you are using so we are sure which version of GnomAD went into our build. I suspect the frequency has flipped between the 1% threshold between GnomAD releases. Best wishes Damian … <#m_5056933858742594577_>
Hi, the database and exomiser version I'm currently using is as following:
exomiser version: 12.1.0 database version: 2007_hg19
If this problem is due to gnomAD releases, can you recommend a way I can deal with this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/386#issuecomment-726658928, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHO4PDI3K2QBMG6RD6LNGDSPT5BDANCNFSM4TQHFHWQ .
Okay, thanks for your effort to check, and again for the great software!
Best regards, Sungeun
Exomiser data hg19_2007
| source | rs141509078 | rs146768859 |
|---|---|---|
| THOUSAND_GENOMES | 0.13980001 | 0.13980001 |
| TOPMED | 0.077250004 | 0.05415 |
| ESP_EUROPEAN_AMERICAN | - | 0.0116 |
| ESP_ALL | - | 0.0077 |
| EXAC_AMERICAN | 0.00865651 | - |
| EXAC_EAST_ASIAN | 0.67052025 | 1.2056044 |
| EXAC_OTHER | 0.11013216 | - |
| GNOMAD_E_AMR | 0.014899577 | - |
| GNOMAD_E_EAS | 0.7654836 | 0.70774865 |
| GNOMAD_E_OTH | 0.03652301 | - |
| GNOMAD_E_SAS | 0.003248863 | - |
| GNOMAD_G_AFR | 0.011457379 | - |
| GNOMAD_G_EAS | 1.1757426 | 0.80147964 |
| GNOMAD_G_OTH | - | 0.10183299 |
You need to compare with gnomAD v2 which is what's being used in the Exomiser database. gnomAD v3 is genomes only and only called against hg38. Notice if you uncheck the genomes/exomes box in the link below the frequencies change.
https://gnomad.broadinstitute.org/variant/rs141509078?dataset=gnomad_r2_1 Genomic coordinates: 15-72191053-T-G (GRCh37)
| Datasource | AC | AN | AF (%) |
|---|---|---|---|
| v2.1.1 exomes EAS | 145 | 18390 | 0.7885 |
| v2.1.1 genomes EAS | 18 | 1556 | 1.157 |
| v3 genomes EAS | 44 | 5190 | 0.8478 |
Here you can see that, on its own the genomes EAS has a frequency of 1.157% gnomAD add and remove samples and change their filtering with each release so the sample size and hence the frequency will change between releases.
The Exomiser hg19_2007 source data was downloaded from:
http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/ExAC.0.3.GRCh37.vcf.gz http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/gnomad.exomes.r2.0.1.sites.noVEP.vcf.gz http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/gnomad.genomes.r2.0.1.sites.noVEP.vcf.gz
$ tabix http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/ExAC.0.3.GRCh37.vcf.gz 15:72191053-72191053
15 72191053 rs141509078 T G 182588.64 PASS AC=60;AC_AFR=0;AC_AMR=1;AC_Adj=60;AC_EAS=58;AC_FIN=0;AC_Het=60;AC_Hom=0;AC_NFE=0;AC_OTH=1;AC_SAS=0;AF=4.942e-04;AN=121412;AN_AFR=10402;AN_AMR=11552;AN_Adj=121358;AN_EAS=8650;AN_FIN=6612;AN_NFE=66726;AN_OTH=908;AN_SAS=16508;...
$ tabix http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/gnomad.exomes.r2.0.1.sites.noVEP.vcf.gz 15:72191053-72191053
15 72191053 rs141509078 T G 373313.05 PASS AC=140;AF=5.69411e-04;AN=245868;BaseQRankSum=-3.24100e+00;ClippingRankSum=1.71000e-01;DB;DP=9317132;FS=0.00000e+00;InbreedingCoeff=6.00000e-03;MQ=5.93500e+01;MQRankSum=-1.90000e-02;QD=1.34900e+01;ReadPosRankSum=6.93000e-01;SOR=7.02000e-01;VQSLOD=3.67000e+00;VQSR_culprit=MQ;VQSR_POSITIVE_TRAIN_SITE;GQ_HIST_ALT=0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|140;DP_HIST_ALT=0|0|0|0|0|1|0|1|1|1|4|2|2|4|4|2|2|3|2|0;AB_HIST_ALT=0|0|0|0|0|0|1|2|15|53|44|16|8|0|1|0|0|0|0|0;GQ_HIST_ALL=41|26|24|49|66|38|70|79|44|87|109|83|310|107|306|142|440|129|603|120383;DP_HIST_ALL=33|126|174|201|413|608|7577|45743|32994|13527|5923|3386|2016|1535|1219|1045|794|753|693|673;AB_HIST_ALL=0|0|0|0|0|0|1|2|15|53|44|16|8|0|1|0|0|0|0|0;AC_AFR=0;AC_AMR=5;AC_ASJ=0;AC_EAS=132;AC_FIN=0;AC_NFE=0;AC_OTH=2;AC_SAS=1;AC_Male=68;AC_Female=72;AN_AFR=15300;AN_AMR=33558;AN_ASJ=9844;AN_EAS=17244;AN_FIN=22290;AN_NFE=111376;AN_OTH=5476;AN_SAS=30780;
$ tabix http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/gnomad.genomes.r
2.0.1.sites.noVEP.vcf.gz 15:72191053-72191053
15 72191053 rs141509078 T G 5998.03 PASS AC=20;AF=6.45953e-04;AN=30962;AC_AFR=1;AC_AMR=0;AC_ASJ=0;AC_EAS=19;AC_FIN=0;AC_NFE=0;AC_OTH=0;AC_Male=14;AC_Female=6;AN_AFR=8728;AN_AMR=838;AN_ASJ=302;AN_EAS=1616;AN_FIN=3494;AN_NFE=15004;AN_OTH=980;AN_Male=17112;AN_Female=13850;AF_AFR=1.14574e-04;AF_AMR=0.00000e+00;AF_ASJ=0.00000e+00;AF_EAS=1.17574e-02;AF_FIN=0.00000e+00;AF_NFE=0.00000e+00;AF_OTH=0.00000e+00;AF_Male=8.18139e-04;AF_Female=4.33213e-04;GC_AFR=4363,1,0;GC_AMR=419,0,0;GC_ASJ=151,0,0;GC_EAS=789,19,0;GC_FIN=1747,0,0;GC_NFE=7502,0,0;GC_OTH=490,0,0;GC_Male=8542,14,0;GC_Female=6919,6,0;AC_raw=20;AN_raw=30992;AF_raw=6.45328e-04;GC_raw=15476,20,0;GC=15461,20,0;AC_POPMAX=19;AN_POPMAX=1616;AF_POPMAX=1.17574e-02
Extracting the EAS data gives us these values:
| Datasource | AC | AN | AF (%) (AF = (AC/AN) *100) |
|---|---|---|---|
| ExAC EAS | 58 | 8650 | 0.67 |
| v2.0.1 exomes EAS | 132 | 17244 | 0.765 |
| v2.1.1 exomes EAS | 145 | 18390 | 0.7885 |
| v2.0.1 genomes EAS | 19 | 1616 | 1.176 |
| v2.1.1 genomes EAS | 18 | 1556 | 1.157 |
| v3 genomes EAS | 44 | 5190 | 0.8478 |
Which matches what you get from the Exomiser hg19_2007 dataset, so I don't see any errors here.
A possible solution would be that if you're analysing an exome it might be better to only use exome sources as the number of samples used in the gnomAD genomes GRCh37 dataset was relatively small which will lead to inaccurate/ higher frequencies.