hap.py
hap.py copied to clipboard
How is the false positive rate calculated in som.py stats?
Dear hap.py developer team, I have a question regarding the output of som.py.
-
Question: I ran som.py (v0.3.15) using a short variants callset and a ground truth set. The tool ran successfully and the results seem reasonable. However, at the end of each line within the
<prefix>.sompy.stats.csv
file I noticed a fieldfp.rate
which made me wonder how exactly this is computed here? -
Background: The false positive rate (FPR) is commonly defined as FP/(FP+TN). Hence, I presume TN is computed at some point. There exists a README page dedicated to som.py but the number of True Negatives (TN) is not defined there. The bioRxiv preprint of hap.py+som.py even has a paragraph on this stating that TN are not included due to a lack of a clear definition (with which I strongly agree!):
Note that we have chosen not to include true negatives (or consequently specificity) in our standardized definitions. This is due to the challenge in defining the number of true negatives, particularly around complex variants. In addition, precision is often a more useful metric than specificity due to the very large proportion of true negative positions in the genome.
- Example: Here is an example of my output. There is a non-zero FPR at the end of the line.
idx type total.truth total.query tp fp fn unk ambi recall recall_lower recall_upper recall2 precision precision_lower precision_upper na ambiguous fp.region.size fp.rate sompyversion sompycmd
0 indels 180 153 151 2 29 0 0 0.8388888888888889 0.7799816161378756 0.8870047190333543 0.8388888888888889 0.9869281045751634 0.9587317223755603 0.997273934669216 0.0 0.0 29903 66.88292144600877 som.py- /<path>/bin/som.py --no-fixchr-truth --no-fixchr-query --normalize-all -r <path>/<reference>.fasta -o <prefix>.sompy <truthset>.vcf <callset>.vcf.gz