duphold icon indicating copy to clipboard operation
duphold copied to clipboard

Error running Duphold, "expected AD field in snps VCF"

Open johnemajor opened this issue 5 years ago • 12 comments

I'm running duphold v 0.2.1

The command I am running is: ~/install_crap/duphold/duphold -v ./results.vcf.gz -b /efs/WGS/data/WGS/ILMN_exptA/b37/KC_downsampledBAM_and_VCF/NA24385/40x/S7508/NA24385.40x.S7508.aligned.deduped.sort.bam -f /efs/WGS/data/reference/human/human_g1k_v37_modified.fasta/human_g1k_v37_modified.fasta -s ./x.vcf.gz -t 96 -o ./duphold.vcf

It spins for a while, then returns the error: "expected AD field in snps VCF"

the thing is, my VCF files have the AD field present..... I created a small vcf to test this more directly, and here are the full contents of the x.vcf.gz

`##fileformat=VCFv4.3 ##reference=human_g1k_v37_modified ##octopus=<version=0.6.3-beta_HEAD_5961a546,command="octopus --reference /home/kcibul/wgs_resources/data/reference/human/human_g1k_v37_modified.fasta
/human_g1k_v37_modified.fasta --reads NA24385.40x.S7508.aligned.deduped.sort.bam -t regions.bed --forest-file /home/kcibul/wgs_resources/data/referen
ce/forests/DC/germline.v0.7.0.forest -o NA24385.40x.S7508.octopus.tmp.vcf.gz --threads 192 -X 10000MB -B 38400 MB --sequence-error-model /home/kcibul
/wgs_resources/data/reference/octopus_err_models/novaseq.4a38e55.model --annotations AD ADP AF ARF BQ CRF DP FRF GC GQ MC MF MQ MQ0 QD QUAL SB STR_LE
NGTH STR_PERIOD --max-indel-errors 32 --duplicate-read-detection-policy AGGRESSIVE --max-haplotypes=400 --min-forest-quality=8",options="--allow-mark
ed-duplicates no --allow-octopus-duplicates no --allow-pileup-candidates-from-likely-misaligned-reads no --allow-qc-fails no --allow-reads-with-good-
decoy-supplementary-alignments no --allow-reads-with-good-unplaced-or-unlocalized-supplementary-alignments no --allow-secondary-alignments no --allow
-supplementary-alignments no --annotations[0] AD --annotations[1] ADP --annotations[2] AF --annotations[3] ARF --annotations[4] BQ --annotations[5] C
RF --annotations[6] DP --annotations[7] FRF --annotations[8] GC --annotations[9] GQ --annotations[10] MC --annotations[11] MF --annotations[12] MQ --
annotations[13] MQ0 --annotations[14] QD --annotations[15] QUAL --annotations[16] SB --annotations[17] STR_LENGTH --annotations[18] STR_PERIOD --asse
mble-all no --assembler-mask-base-quality 10 --backtrack-level NONE --bad-region-tolerance NORMAL --bamout-type MINI --caller population --consider-u
nmapped-reads no --contig-output-order REFERENCE_INDEX --contig-ploidies[0] Y=1 --contig-ploidies[1] chrY=1 --contig-ploidies[2] MT=1 --contig-ploidi
es[3] chrM=1 --denovo-filter-expression QUAL < 50 | PP < 40 | GQ < 20 | MQ < 30 | AF < 0.1 | SB > 0.95 | BQ < 20 | DP < 10 | DC > 1 | MF > 0.2 | FRF \

0.5 | MP < 30 | MQ0 > 2 --denovo-indel-mutation-rate 1e-09 --denovo-snv-mutation-rate 1.3e-08 --denovos-only no --disable-adapter-masking no --disa
ble-assembly-candidate-generator no --disable-call-filtering no --disable-denovo-variant-discovery no --disable-downsampling no --disable-inactive-fl
ank-scoring no --disable-overlap-masking no --disable-pileup-candidate-generator no --disable-read-preprocessing no --disable-repeat-candidate-genera
tor no --dont-model-mapping-quality no --dont-protect-reference-haplotype no --downsample-above 1000 --downsample-target 500 --dropout-concentration
2 --duplicate-read-detection-policy AGGRESSIVE --extension-level NORMAL --fallback-kmer-gap 10 --fast no --filter-expression QUAL < 10 | MQ < 10 | MP
< 10 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1 --forest-file germline.v0.7.0.forest --good-base-quality 20 --haplotype-holdout-threshold 2500 --hap
lotype-overflow 200000 --ignore-unmapped-contigs no --indel-heterozygosity 0.0001 --keep-unfiltered-calls no --kmer-sizes[0] 10 --kmer-sizes[1] 15 --
kmer-sizes[2] 20 --lagging-level NORMAL --mask-3prime-shifted-soft-clipped-heads no --mask-inverted-soft-clipping no --mask-soft-clipped-bases no --m
ask-soft-clipped-boundary-bases 2 --max-assemble-region-overlap 200 --max-bubbles 30 --max-clones 3 --max-genotypes 5000 --max-haplotypes 400 --max-h
oldout-depth 20 --max-indel-errors 32 --max-joint-genotypes 1000000 --max-open-read-files 250 --max-phylogeny-size 3 --max-read-length 10000 --max-re
ference-cache-footprint 10GB --max-region-to-assemble 600 --max-somatic-haplotypes 2 --max-variant-size 2000 --max-vb-seeds 12 --min-bubble-score 2 -
-min-candidate-credible-vaf-probability 0.75 --min-clone-frequency 0.01 --min-credible-somatic-frequency 0.01 --min-denovo-posterior 3 --min-expected
-somatic-frequency 0.03 --min-forest-quality 8 --min-good-bases 20 --min-kmer-prune 2 --min-mapping-quality 20 --min-phase-score 10 --min-pileup-base
-quality 20 --min-protected-haplotype-posterior 100 --min-refcall-posterior 2 --min-somatic-posterior 0.5 --min-variant-posterior 1 --model-posterior
UnknownType(N7octopus7options20ModelPosteriorPolicyE) --no-adapter-contaminated-reads no --no-reads-with-decoy-supplementary-alignments no --no-read
s-with-distant-segments no --no-reads-with-unmapped-segments no --no-reads-with-unplaced-or-unlocalized-supplementary-alignments no --normal-contamin
ation-risk LOW --num-fallback-kmers 10 --one-based-indexing no --organism-ploidy 2 --output NA24385.40x.S7508.octopus.tmp.vcf.gz --read-linkage PAIRE
D --reads[0] NA24385.40x.S7508.aligned.deduped.sort.bam --refcall-block-merge-quality 10 --refcall-filter-expression QUAL < 2 | GQ < 20 | MQ < 10 | D
P < 10 | MF > 0.2 --reference human_g1k_v37_modified.fasta --regions-file regions.bed --resolve-symlinks no --sequence-error-model /home/kcibul/wgs_r
esources/data/reference/octopus_err_models/novaseq.4a38e55.model --sites-only no --snp-heterozygosity 0.001 --snp-heterozygosity-stdev 0.01 --somatic
-credible-mass 0.9 --somatic-filter-expression QUAL < 2 | GQ < 20 | MQ < 30 | SMQ < 40 | SB > 0.9 | SD > 0.9 | BQ < 20 | DP < 3 | MF > 0.2 | NC > 1 |
FRF > 0.5 --somatic-indel-mutation-rate 1e-06 --somatic-snv-mutation-rate 0.0001 --somatics-only no --split-long-reads no --target-read-buffer-footp
rint 38.4kB --temp-directory-prefix octopus-temp --threads 192 --tumour-germline-concentration 1.5 --use-filtered-source-candidates no --use-independ
ent-genotype-priors no --use-preprocessed-reads-for-filtering no --use-same-read-profile-for-all-samples no --use-uniform-genotype-priors no --varian
t-discovery-mode ILLUMINA --very-fast no"> ##INFO=<ID=RFQUAL_ALL,Number=1,Type=Float,Description="Combined quality score for call using product of sample RFQUAL"> ##INFO=<ID=STR_PERIOD,Number=1,Type=String,Description="Period of overlapping STR"> ##INFO=<ID=STR_LENGTH,Number=1,Type=String,Description="Length of overlapping STR"> ##INFO=<ID=QD,Number=1,Type=String,Description="QUAL divided by DP"> ##INFO=<ID=MQ0,Number=1,Type=String,Description="Number of reads overlapping the call with mapping quality zero"> ##INFO=<ID=MQ,Number=1,Type=String,Description="Mean mapping quality of reads overlapping the call"> ##INFO=<ID=GC,Number=1,Type=String,Description="GC bias of the reference in a window centred on the call"> ##INFO=<ID=CRF,Number=1,Type=String,Description="Fraction of clipped reads covering the call"> ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Combined depth across samples"> ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data"> ##INFO=<ID=END,Number=1,Type=Integer,Description="End position on CHROM"> ##INFO=<ID=MP,Number=1,Type=Float,Description="Model posterior"> ##INFO=<ID=AD,Number=1,Type=String,Description="Minor empirical alt allele depth"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality (phred-scaled)"> ##FORMAT=<ID=DP,Number=1,Type=String,Description="Number of read overlapping the call"> ##FORMAT=<ID=MQ,Number=1,Type=Integer,Description="RMS mapping quality"> ##FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set"> ##FORMAT=<ID=PQ,Number=1,Type=Integer,Description="Phasing quality"> ##FORMAT=<ID=AD,Number=1,Type=String,Description="Minor empirical alt allele depth"> ##FORMAT=<ID=ADP,Number=1,Type=String,Description="Number of reads overlapping the position that could be assigned to an allele"> ##FORMAT=<ID=AF,Number=1,Type=String,Description="Empirical minor allele frequency of ALT alleles (AD / ADP)"> ##FORMAT=<ID=ARF,Number=1,Type=String,Description="Fraction of reads overlapping the call that cannot be assigned to a unique haplotype"> ##FORMAT=<ID=BQ,Number=1,Type=String,Description="Median base quality of reads supporting each allele"> ##FORMAT=<ID=FRF,Number=1,Type=String,Description="Fraction of reads filtered for calling"> ##FORMAT=<ID=MC,Number=1,Type=String,Description="Number of mismatches at variant position in reads supporting variant haplotype"> ##FORMAT=<ID=MF,Number=1,Type=String,Description="Fraction of reads with mismatches at variant position"> ##FORMAT=<ID=SB,Number=1,Type=String,Description="Strand bias of reads based on haplotype support"> ##FORMAT=<ID=RFQUAL,Number=1,Type=Float,Description="Empirical quality score from random forest classifier"> ##FORMAT=<ID=FT,Number=1,Type=String,Description="Sample genotype filter indicating if this genotype was 'called'"> ##FILTER=<ID=PASS,Description="All filters passed"> ##FILTER=<ID=RF8.6,Description="All filters passed"> ##FILTER=<ID=RF,Description="Random Forest filtered"> ##contig=<ID=1,length=249250621> ##contig=<ID=2,length=243199373> ##contig=<ID=3,length=198022430> ##contig=<ID=4,length=191154276> ##contig=<ID=5,length=180915260> ##contig=<ID=4,length=191154276> ##contig=<ID=5,length=180915260> ##contig=<ID=6,length=171115067> ##contig=<ID=7,length=159138663> ##contig=<ID=8,length=146364022> ##contig=<ID=9,length=141213431> ##contig=<ID=10,length=135534747> ##contig=<ID=11,length=135006516> ##contig=<ID=12,length=133851895> ##contig=<ID=13,length=115169878> ##contig=<ID=14,length=107349540> ##contig=<ID=15,length=102531392> ##contig=<ID=16,length=90354753> ##contig=<ID=17,length=81195210> ##contig=<ID=18,length=78077248> ##contig=<ID=19,length=59128983> ##contig=<ID=20,length=63025520> ##contig=<ID=21,length=48129895> ##contig=<ID=22,length=51304566> ##contig=<ID=X,length=155270560> ##contig=<ID=Y,length=59373566> ##contig=<ID=MT,length=16569> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA24385.40x.S7508 1 10583 . G A 201.31 PASS RFQUAL_ALL=2.76;STR_PERIOD=0;STR_LENGTH=0;QD=3.097;MQ0=25;MQ=46.104;GC=0.710;CRF=0.111;AC=1;A
N=2;DP=65;NS=1;AD=150 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|0:201:65:51:10583:100:16:76:0.211:0.074:37:0.198:0:0.000
:0.244:2.76:RF 1 10616 . CCGCCGTTGCAAAGGCGCGCCG C 356.94 RF;RF8.6 RFQUAL_ALL=4.51;STR_PERIOD=5;STR_LENGTH=12;QD=15.519;MQ0=25;MQ=30.110
;GC=0.710;CRF=0.222;AC=2;AN=2;DP=23;NS=1;AD=150 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|1:110:23:42:10583:100:81:81:1.00
0:0.580:.:0.716:47:0.580:.:4.51:RF 1 12783 . G A 1271 RF;RF8.6 RFQUAL_ALL=7.23;STR_PERIOD=7;STR_LENGTH=15;QD=18.420;MQ0=353;MQ=15.787;GC=0.588;CRF=0
.000;AC=2;AN=2;DP=69;NS=1;AD=150 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|1:110:69:24:12783:100:150:150:1.000:0.000:37:0.6
08:0:0.000:.:7.23:RF `

I ran pyvcf on this VCF file, and the FORMAT AD field appears in it. I ran duphold on VCF files generated by Dragen and Sentieon, they both return the same error.

Any advice for how to fix this? I am eager to apply duphold to a clinical product I am developing, but this roadblock has me blocked.

Thanks! John Major

johnemajor avatar Jan 11 '20 05:01 johnemajor

hi, you can run duphold without a snp vcf and just find changes in depth in your SVs. your SNP vcf has:

##INFO=<ID=AD,Number=1,Type=String,Description="Minor empirical alt allele depth">

which does not follow spec. It should be:

##INFO=<ID=AD,Number=R,Type=Integer,Description="allele depths">

brentp avatar Jan 11 '20 14:01 brentp

My SNP VCFs actually have the AD field in the FORMAT section, not the INFO section. AD should be in the FORMAT section as it could be used for a multi-sample VCF.....

But, I tried hacking my snp VCF to have the INFO header you name above and moving the AD field to INFO, I get the same error.

johnemajor avatar Jan 22 '20 19:01 johnemajor

shoot. I mean FORMAT, not INFO. what does your VCF have for AD in FORMAT?

brentp avatar Jan 22 '20 19:01 brentp

oh, I see your AD is:

##FORMAT=<ID=AD,Number=1,Type=String,Description="Minor empirical alt allele depth">

should be Number=R, type=Integer and describe the depths for each allele.

brentp avatar Jan 22 '20 20:01 brentp

Actually, AD is specified as both an INFO field AND a FORMAT field. The INFO version is presumably the sum of all the FORMAT ones, although I don’t see why that’s particularly useful.

johnemajor avatar Jan 22 '20 20:01 johnemajor

But, yes, mine is in FORMAT. And it should appear as : ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="allele depths"> . ?

johnemajor avatar Jan 22 '20 20:01 johnemajor

the FORMAT is the one that duphold uses and yes, that is correct. However, you can't simply change the header, you'll have to adjust the values for every record as well (or use a VCF from another caller)

brentp avatar Jan 22 '20 20:01 brentp

The variant FORMAT record AD fields contain integers.... how would they need to be adjusted?

1 10583 . G A 201.31 RF;RF8.6 RFQUAL_ALL=2.76;STR_PERIOD=0;STR_LENGTH=0;QD=3.097;MQ0=25;MQ=46.104;GC=0.710;CRF=0.111;AC=
1;AN=2;DP=65;NS=1 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|0:201:65:51:10583: 100:16:76:0.211:0.074:37:0.198:0:0.000:0.244:2.7
6:RF

The AD entry is a valid integer.....

johnemajor avatar Jan 23 '20 01:01 johnemajor

AD should have multiple values for each samples. for a bi-allelic variant, it should have 2 values, the first indicates the number of reads supporting the reference allele and the 2nd indicates the number of reads supporting the alternate.

I still suggest to run duphold without the snp vcf as it's not needed and doesn't add much for most cases.

brentp avatar Jan 23 '20 01:01 brentp

Well, it appears that just changing the header did the trick. Duphold is running now that I modified the snp.vcf to have ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="allele depths">

johnemajor avatar Jan 23 '20 01:01 johnemajor

it won't give you correct results.

brentp avatar Jan 23 '20 01:01 brentp

ok- will give it a shot Thank You

johnemajor avatar Jan 23 '20 01:01 johnemajor