hap.py icon indicating copy to clipboard operation
hap.py copied to clipboard

Warning in log file ("Too many AD fields") for gatk-vcf only

Open ghost opened this issue 6 years ago • 3 comments

Hi Peter, I am using hap.py to compare vcf files against the GIAB reference in two separate runs for freebayes (freebayes.vcf) and GATK4 HaplotypeCaller (gatk.vcf), respectively. Both vcf files were produced using the same reference and bed-files.

For both hap.py runs I get proper output files without any error messages.

example command for the GATK4 run:

python2 bin/hap.py \
/path/to/giab.vcf \ 
/path/to/gatk.vcf \
-o /path/to/out_dir/gatk \
-f /path/to/giab.bed \
-r /path/to/GRCh37.fa \
-T /path/to/my_targets.bed \
--logfile /path/to/out_dir/gatk.log

However, the logfile for the hap.py run using gatk.vcf gives me a total of 166 lines of warnings but it does not appear when I use freebayes.vcf. Here, the first and the last three rows are shown:

2019-06-06 12:44:33,316 WARNING [W] too many AD fields at chr10:51486091 max_ad = 2 retrieved: 3 2019-06-06 12:44:36,366 WARNING [W] too many AD fields at chr11:130780269 max_ad = 2 retrieved: 3 2019-06-06 12:44:39,376 WARNING [W] too many AD fields at chr13:114453604 max_ad = 2 retrieved: 3 [...] 2019-06-06 13:08:35,944 WARNING [W] too many AD fields at chrX:1317598 max_ad = 2 retrieved: 3 2019-06-06 13:08:40,998 WARNING [W] too many AD fields at chr8:125759667 max_ad = 2 retrieved: 3 2019-06-06 13:08:42,328 WARNING [W] too many AD fields at chr7:100424586 max_ad = 2 retrieved: 3

Although this warning and the corresponding error messages have been reported in other issues (e.g. https://github.com/Illumina/hap.py/issues/41), in my case I only get a warning without any error message. By using the --no-decompose and --no-leftshift parameters in another GATK4 run I could get rid of the the warnings. However, I actually would like to keep the left shift.

I checked the gatk.vcf for some of the locations shown in the warnings assuming that the gatk.vcf might contain problematic entries. I realized that many of the locations shown in the warnings aren't actually entries in the gatk.vcf. Maybe I don't understand the warning message correctly, but to me it seems that it should refer to entries that are present in the actually used gatk.vcf file.

Beside the missing entries, I could detect one entry that is present in both vcf-files: chr11:130780269. The two corresponding vcf entries are shown here:

freebayes.vcf: chr11 130780269 . CCAAAA C 2801 . MQM=60 GT:DP:AO 0/1:115:53

gatk.vcf: chr11 130780269 . CCAAAA C 2029.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.303;DP=112;ExcessHet=3.0103;FS=1.617;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=20.30;ReadPosRankSum=0.271;SOR=0.787 GT:AD:DP:GQ:PL 0/1:48,52:100:99:2037,0,1800

Now I have some questions regarding the mentioned problems:

  1. Why aren't many of the entries mentioned in the warnings actually present in the gatk.vcf? What does the warning actually refer to?
  2. Regarding chr11:130780269, is there anything weird in gatk.vcf that could explain the error? To me both vcf files/entries seem to be valid.
  3. How do the warning messages actually influence my final results? Since many of the entries seem not to be in the actual gatk.vcf, I don't know whether I should bother about this warning at all (as long as there is not error message terminating the program).

It would be nice if you could help me :)

ghost avatar Jun 06 '19 12:06 ghost

I suspect that decomposition somehow doesn’t get the AD values quite right. I think the easiest way to fix this would be to drop the AD fields in advance using bcftools. They are not used by hap.py.

I think the following happens:

  • hap.py preprocess splits the alleles into individual rows.
  • these rows then generate the errors
  • preprocess left-shifts so that the records cannot be matched directly with the original input records anymore.

Hope this helps!

pkrusche avatar Jun 11 '19 22:06 pkrusche

I have had this issue and for me it is with the vcf lines with many alleles listed/observed but only few (1 or 2) of them called. Once I droped the extra alleles the issue gets fixed.

gkaur avatar Mar 04 '20 20:03 gkaur

Hi there @pkrusche,

After solving my own #174 issue, I'm having the exact same problem.

How can I drop the AD field with bcftools. Thanks in advance!

Overcraft90 avatar Apr 18 '23 18:04 Overcraft90