pandora
pandora copied to clipboard
Differences in gene presence/absence
I have been looking at differences between two almost identical Klebsiella isolates (KN0056A-F and KN0056A-L) in the Pandora vcf output. For several regions in the reference Pandora is suggesting that one isolate has zero coverage (so absent) while the other is present. However, when I check this gene in the de novo assemblies I find it in both isolates (and with zero differences between them).
This is one example:
##contig=<ID=Cluster_560>
Cluster_560 1 . CGTA CGTG . . SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 1 . CGTAAAGCACCTCGACGCCATTCAGAATTTCGGCGCGATGGACATCCTCTGCACCGATAAAACCGGCACCTTGACCCAGGATAAGATTGTGCTGGAGAACCATACCGACGTCTCCGGCAAGGTCAGCGAGCGCGTACTGCATGCCGCTTGGCTGAACAGCCACTACCAGACCGGCCTGAAAAATCTGCTCGACACCGCGGTGCTGGACGGGGTTGAGCTGGATGCCGCCCGCGGGCTGGCGGCGCGCTGGCAGAAAGTGGATGAGATCCCCTTCGATTTCGAACGCCGCCGCATGTCGGTGGTGGTGAAAGAGGAGGACGCCGCGCATCAGCTGATCTGCAAAGGGGCGCTGCAGGAGATCCTCAACGTCTCGACCCAGGTGCGCTACAACGGCGATATCGTACCGCTGGACGACACCATGCTGCGCCGCATTCGCCGGGTGACCGATACCCTCAACCGACAGGGGCTACGGGTGGTGGCGGTGGCGACCAAATACCTGCCGGCCCGCGAAGGCGACTACCAGCGCGCCGATGAGTCGGACCTGATCCTTGAAGG CGTGAAGCACCTCGACGCCATTCAGAATTTCGGCGCGATGGACATCCTCTGCACTGATAAAACCGGCACCCTGACCCAGGATAAGATTGTGCTGGAGAACCATACCGACGTCTCCGGCAAGGTCTGCGAGCGGGTACTGCATGCCGCCTGGCTCAACAGCCACTACCAGACCGGCCTGAAAAACCTGCTCGACACCGCGGTGCTGGACGGGGTTGAGCTGGATGCCGCCCGCGGGCTGGCGGAACGCTGGCAGAAGGTGGATGAGATACCCTTCGACTTCGAACGCCGCCGCATGTCGGTGGTGGTGAAGGAGGATGACGCCGCGCATCAGCTGATCTGCAAAGGGGCGCTGCAGGAGATCCTCAACGTCTCGACCCAGGTGCGCTACAACGGCGATATCGTACCTCTGGATGACACCATGCTGCGCCGCATTCGCCGGGTGACCGATACCCTCAACCGGCAGGGGCTGCGGGTGGTGGCGGTGGCGACTAAATACCTGCCGGCCCGCGAAGGCGACTACCAGCGCGCCGATGAGTCGGACCTGATCCTTGAAGGTTACATCGCCTTCCTCGATCCGCCGAAAGAGACCACCGCCCCGGCGCTGAAGGCGCTGAAGGCCAGCGGCATCACGGTGAAGATCCTCACCGGCGACAGCGAGCTGGTGGCGGCGAAGGTGTGCCATGAAGTGGGACTGGATGCTGGCGAAGTGGTGATTGGCAGCCAGATCGAAGCCATGAGCGACGACGAACTGGCGGCGCTGGCCAAACGCACCACGCTGTTCGCCCGCCTGGCGCCGCTGCATAAAGAGCGTATCGTGACGCTGCTCAAGCGTGAAGGTCACGTGGTGGGCTTTATGGGCGACGGCATCAACGACGCCCCGGCGCTGCGCGCGGCGGATATCG,GTCGGACCTGATCCTTGAAGGTTACATCGCCTTCCTCGATCCGCCGAAAGAGACCACCGCCCCGGCGCTGAAGGCGCTGAAGGCCAGCGGCATCACG . . SVTYPE=COMPLEX;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:2,0,0:1,0,0:0,0,0:0,0,0:140,66,3:98,31,0:0.693548,0.879121,1:-42.3946,-70.1891,-73.8155:27.7945 .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-88,-88,-88:0
Cluster_560 28 . TTTC CTTT . . SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 43 . C T . . SVTYPE=SNP;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 55 . C T . . SVTYPE=SNP;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 71 . TTGACC CTGACC,CTGACT . . SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-60,-60,-60:0 .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-88,-88,-88:0
Cluster_560 125 . A T . . SVTYPE=SNP;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 133 . C G . . SVTYPE=SNP;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 148 . TTGGCTG CTGGCTC,CTGGCTT . . SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-60,-60,-60:0 .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-88,-88,-88:0
Cluster_560 175 . CC GT . . SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 184 . T C . . SVTYPE=SNP;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 208 . CGGG GGGC . . SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 223 . T G . . SVTYPE=SNP;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 243 . CG AA . . SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 256 . AGTG GGTA,GGTG . . SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-60,-60,-60:0 .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-88,-88,-88:0
Cluster_560 268 . C A . . SVTYPE=SNP;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 277 . T C . . SVTYPE=SNP;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 310 . AGAGGAG GGAGGAT . . SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,2:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 340 . C T . . SVTYPE=SNP;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:5,0:3,0:5,0:3,0:10,0:6,0:0,1:-13.395,-96.8414:83.4463 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 373 . G A . . SVTYPE=SNP;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:6,0:5,0:9,0:7,0:19,0:15,0:0.333333,1:-20.0891,-110.657:90.5677 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 403 . A G . . SVTYPE=SNP;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:12,0:10,0:12,0:10,0:24,0:20,0:0,1:-3.64484,-161.314:157.669 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 430 . CATTCGC CATCCGT,TATTCGC . . SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:10,0,0:9,0,0:11,0,0:10,0,0:32,0,0:28,0,0:0,1,1:-4.71713,-147.498,-147.498:142.781 .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-88,-88,-88:0
Cluster_560 430 . CATTCGCCGGGTGACCGATACCCTCAACCGACAGGGGCTA CATTCGCCGGGTCACTGACACCCTCAACCGGCAAGGGCTG,CATTCGCCGGGTGACCGATACCCTCAACCGTCAGGGACTG,CATTCGCCGGGTGACCGATACCCTGAACCGTCAGGGACTG . . SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:8,1,5,4:6,1,4,4:9,0,4,0:8,0,4,0:66,12,32,32:54,10,28,28:0.125,0.875,0.5,0.571429:-98.8227,-192.901,-137.715,-145.667:38.8924 .:0,0,0,0:0,0,0,0:0,0,0,0:0,0,0,0:0,0,0,0:0,0,0,0:1,1,1,1:-88,-88,-88,-88:0
Cluster_560 430 . CATTCGCCGGGTGACCGATACCCTCAACCGACAGGGGCTACGGGTGGTGGCGGTGGCGACCAAATACCTGCCGGCCCGCGAAGGCGACTACCAGCGCGCCGATGAGTCGGACCTGATCCTTGAAGG CATCCGTCGGGTGACCGATACCCTCAACCGACAGGGG,CATCCGTCGGGTGACCGATACCCTCAACCGGCAGGGG,CATCCGTCGGGTGACCGATAGCCTCAACCGACAGGGG,CATCCGTCGGGTGACCGATAGCCTCAACCGGCAGGGG,CATTCGCCGGGTGACCGATACCCTCAACCGACAGGGG,CATTCGCCGGGTGACCGATACCCTCAACCGGCAGGGG,CATTCGCCGGGTGACCGATAGCCTCAACCGACAGGGG,CATTCGCCGGGTGACCGATAGCCTCAACCGGCAGGGG,TATTCGCCGGGTGACCGATACCCTCAACCGACAGGGG,TATTCGCCGGGTGACCGATACCCTCAACCGGCAGGGG,TATTCGCCGGGTGACCGATAGCCTCAACCGACAGGGG,TATTCGCCGGGTGACCGATAGCCTCAACCGGCAGGGG . . SVTYPE=COMPLEX;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 5:5,0,0,0,0,11,7,7,3,0,0,0,0:3,0,0,0,0,10,6,6,3,0,0,0,0:5,0,0,0,0,11,11,11,0,0,0,0,0:3,0,0,0,0,10,10,10,0,0,0,0,0:87,0,0,0,0,23,23,23,23,0,0,0,0:57,0,0,0,0,20,20,20,20,0,0,0,0:0.133333,1,1,1,1,0,0.333333,0.333333,0.666667,1,1,1,1:-261.469,-340.915,-340.915,-340.915,-340.915,-188.162,-239.385,-239.385,-289.456,-340.915,-340.915,-340.915,-340.915:51.223 .:0,0,0,0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0,0,0,0:1,1,1,1,1,1,1,1,1,1,1,1,1:-88,-88,-88,-88,-88,-88,-88,-88,-88,-88,-88,-88,-88:0
Cluster_560 460 . A G . . SVTYPE=SNP;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:6,0:5,0:8,0:6,0:27,0:21,0:0.25,1:-17.5891,-110.657:93.0677 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 460 . ACAGGGGCTA GCAGGGACTG . . SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:6,0:5,0:8,0:5,0:34,0:26,0:0.2,1:-16.0891,-110.657:94.5677 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560 490 . CAAA CAAG,TAAA . . SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:4,1,1:2,1,0:4,0,0:1,0,0:27,8,4:13,5,0:0.166667,0.8,0.75:-34.9876,-80.1269,-85.9402:45.1394 .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-88,-88,-88:0
Cluster_560 555 . G GGCG . . SVTYPE=INDEL;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Thoughts?
De novo assemblies here: /hps/nobackup2/iqbal/projects/pandora/klebs/neonate/data/KpST17_Norway_20190617/contigs/patient-pairs
Pandora output here: /hps/nobackup/iqbal/leandro/klebs_neonate_leah/pandora_compare_results
Thanks for the description! I think the best for this case is to go into debug mode and understand why we have this drop of coverage.
Cheers.
For this issue and https://github.com/rmcolq/pandora/issues/209 , before diving into debugging, I was wondering if we could get the expected results by changing some parameters. When using --illumina parameter, the error rate gets defaulted to 0.001 so it could be too low. Increased to 0.01, but there was no effect on these two genes.
Any other parameterization ideas before diving into debugging? Note that this is strictly a mapping issue. Also worth noting that Leah noticed this issue in many genes, it is the main issue she has right now. I am very interested in this issue because it seems we are undermapping reads (not sure if this is also true for ONT reads), and thus we are making less calls than we could have. It seems to me that fixing this could push our recall in the 4-way analysis way up.
What do you think?
This doesn't look like a simple bug to me, and as you say is likely some combination of parameter/algorithm effects. Worth noting that I have got some built in overrides which mean that your command line error rate is not allowed to be higher than 0.1 with the --illumina flag. The --min_cluster_size is likely to make more of a difference for increasing detection of genes, but will also increase FPs.
I also think this could take weeks to debug, and lead to even more code changes, so I'm keen to get your existing stuff merged in first and the results we need. I think improving our overall recall in the 4-way is an optimization for later.