How to calculate filter thresholds from probabilities.tsv
Hi @ArtRand ,
Thanks for your continuous help! I have a naive question about how to calculate filter threshold from probabilities.tsv (generated from sample-probs). In documents, it says Filtering in modkit is performed by first determining the value of q for the lowest n-th percentile of calls (10th percentile by default)..
When I use modkit pileup, I noticed the estimated cut off is 0.994. I would like to ask if this proba could be calculated from probabilities.tsv and how?
code primary_base range_start range_end count frac percentile_rank
- A 0.5 0.50390625 28 0.00000010300491 0.000005150245
- A 0.50390625 0.5078125 430 0.000001581861 0.00008939354
- A 0.5078125 0.51171875 17 0.000000062538696 0.00017161353
- A 0.51171875 0.515625 94176 0.00034644964 0.017497223
- A 0.515625 0.51953125 29 0.00000010668365 0.03482504
- A 0.51953125 0.5234375 495 0.0000018209796 0.034921423
- A 0.5234375 0.52734375 34 0.00000012507739 0.035018723
- A 0.52734375 0.53125 149473 0.0005498733 0.06251864
- A 0.53125 0.53515625 28 0.00000010300491 0.09001746
- A 0.53515625 0.5390625 544 0.0000020012383 0.09012267
- A 0.5390625 0.54296875 31 0.00000011404114 0.09022843
- A 0.54296875 0.546875 108932 0.00040073323 0.11027079
- A 0.546875 0.55078125 32 0.00000011771989 0.13031334
- A 0.55078125 0.5546875 519 0.0000019092695 0.1304147
- A 0.5546875 0.55859375 37 0.00000013611363 0.13051696
- A 0.55859375 0.5625 164917 0.00060668786 0.16085817
- A 0.5625 0.56640625 24 0.00000008828992 0.19119698
- A 0.56640625 0.5703125 564 0.0000020748132 0.19130513
- A 0.5703125 0.57421875 28 0.00000010300491 0.19141401
- A 0.57421875 0.578125 124975 0.00045975138 0.21440673
- A 0.578125 0.58203125 41 0.00000015082861 0.23740186
- A 0.58203125 0.5859375 587 0.0000021594242 0.23751736
- A 0.5859375 0.58984375 45 0.0000001655436 0.23763362
- A 0.58984375 0.59375 182401 0.0006710071 0.27119225
- A 0.59375 0.59765625 44 0.00000016186485 0.30475068
- A 0.59765625 0.6015625 625 0.0000022992167 0.30487373
- A 0.6015625 0.60546875 41 0.00000015082861 0.30499625
- A 0.60546875 0.609375 142829 0.0005254317 0.33127537
- A 0.609375 0.61328125 55 0.00000020233107 0.35755706
- A 0.61328125 0.6171875 653 0.0000024022215 0.3576873
- A 0.6171875 0.62109375 50 0.00000018393733 0.3578166
- A 0.62109375 0.625 199665 0.00073451694 0.39455163
- A 0.625 0.62890625 54 0.00000019865232 0.43128744
- A 0.62890625 0.6328125 659 0.000002424294 0.43141857
- A 0.6328125 0.63671875 40 0.00000014714986 0.43154714
- A 0.63671875 0.640625 162500 0.00059779634 0.46144432
- A 0.640625 0.64453125 697 0.0000025640863 0.49146232
- A 0.64453125 0.6484375 53 0.00000019497357 0.4916003
- A 0.6484375 0.65234375 219464 0.0008073525 0.53197765
- A 0.65234375 0.65625 57 0.00000020968857 0.5723558
- A 0.65625 0.66015625 675 0.000002483154 0.5724904
- A 0.66015625 0.6640625 58 0.0000002133673 0.5726252
- A 0.6640625 0.66796875 183598 0.00067541056 0.60640645
- A 0.66796875 0.671875 77 0.0000002832635 0.64019114
- A 0.671875 0.67578125 822 0.0000030239298 0.6403565
- A 0.67578125 0.6796875 240258 0.0008838483 0.68470013
- A 0.6796875 0.68359375 58 0.0000002133673 0.7289032
- A 0.68359375 0.6875 857 0.0000031526858 0.7290715
- A 0.6875 0.69140625 75 0.000000275906 0.7292429
- A 0.69140625 0.6953125 205152 0.00075470225 0.76699185
- A 0.6953125 0.69921875 930 0.0000034212344 0.80489796
- A 0.69921875 0.703125 80 0.00000029429972 0.80508375
- A 0.703125 0.70703125 262574 0.0009659432 0.85339564
- A 0.70703125 0.7109375 79 0.000000290621 0.90170735
- A 0.7109375 0.71484375 999 0.000003675068 0.90190566
- A 0.71484375 0.71875 228268 0.00083974015 0.94407636
- A 0.71875 0.72265625 78 0.00000028694222 0.9860777
- A 0.72265625 0.7265625 1072 0.0000039436163 0.98628926
- A 0.7265625 0.73046875 86 0.00000031637222 0.98650223
- A 0.73046875 0.734375 284620 0.0010470449 1.0388703
- A 0.734375 0.73828125 1079 0.0000039693678 1.091421
- A 0.73828125 0.7421875 90 0.0000003310872 1.0916361
- A 0.7421875 0.74609375 253623 0.0009330148 1.1383033
- A 0.74609375 0.75 1176 0.000004326206 1.1851704
- A 0.75 0.75390625 87 0.00000032005096 1.1854028
- A 0.75390625 0.7578125 309196 0.0011374537 1.2422913
- A 0.7578125 0.76171875 1315 0.000004837552 1.2994059
- A 0.76171875 0.765625 81 0.0000002979785 1.2996628
- A 0.765625 0.76953125 280681 0.0010325543 1.3513054
- A 0.76953125 0.7734375 1410 0.000005187033 1.4031924
- A 0.7734375 0.77734375 333861 0.00122819 1.4648613
- A 0.77734375 0.78125 89 0.00000032740846 1.5262872
- A 0.78125 0.78515625 1518 0.0000055843375 1.5265827
- A 0.78515625 0.7890625 308465 0.0011347646 1.5836003
- A 0.7890625 0.79296875 141 0.0000005187033 1.6403643
- A 0.79296875 0.796875 1739 0.0000063973403 1.6407102
- A 0.796875 0.80078125 359781 0.0013235431 1.7072072
- A 0.80078125 0.8046875 1699 0.0000062501904 1.7736968
- A 0.8046875 0.80859375 336307 0.0012371882 1.8358687
- A 0.80859375 0.8125 158 0.000000581242 1.8977573
- A 0.8125 0.81640625 2040 0.0000075046432 1.8981615
- A 0.81640625 0.8203125 390495 0.0014365321 1.9703634
- A 0.8203125 0.82421875 2067 0.000007603969 2.04257
- A 0.82421875 0.828125 368439 0.0013553938 2.1107202
- A 0.828125 0.83203125 2193 0.000008067492 2.178893
- A 0.83203125 0.8359375 418594 0.0015399012 2.2562916
- A 0.8359375 0.83984375 2419 0.000008898888 2.3337317
- A 0.83984375 0.84375 169 0.0000006217082 2.3342078
- A 0.84375 0.84765625 399317 0.001468986 2.4076881
- A 0.84765625 0.8515625 2446 0.000008998214 2.4815872
- A 0.8515625 0.85546875 452082 0.0016630952 2.565192
- A 0.85546875 0.859375 2666 0.000009807539 2.6488369
- A 0.859375 0.86328125 434097 0.0015969329 2.7291741
- A 0.86328125 0.8671875 487538 0.0017935288 2.8986971
- A 0.8671875 0.87109375 2927 0.0000107676915 2.9889119
- A 0.87109375 0.875 471184 0.0017333665 3.0761187
- A 0.875 0.87890625 3097 0.000011393078 3.1633563
- A 0.87890625 0.8828125 521776 0.0019194817 3.2599
- A 0.8828125 0.88671875 3263 0.00001200375 3.3564746
- A 0.88671875 0.890625 515147 0.0018950953 3.4518297
- A 0.890625 0.89453125 562330 0.0020686695 3.6500177
- A 0.89453125 0.8984375 3615 0.000013298669 3.7541163
- A 0.8984375 0.90234375 560013 0.0020601458 3.8577886
- A 0.90234375 0.90625 605824 0.002228673 4.0722294
- A 0.90625 0.91015625 607514 0.0022348901 4.295408
- A 0.91015625 0.9140625 4079 0.000015005608 4.4079022
- A 0.9140625 0.91796875 658117 0.0024210457 4.529705
- A 0.91796875 0.921875 657344 0.002418202 4.7716675
- A 0.921875 0.92578125 712816 0.0026222696 5.023691
- - A 0.92578125 0.9296875 718997 0.0026450078 5.287055
- A 0.9296875 0.93359375 771031 0.0028364277 5.5611267
- A 0.93359375 0.9375 782581 0.0028789171 5.8468943
- A 0.9375 0.94140625 834767 0.0030708963 6.144384
- A 0.94140625 0.9453125 856070 0.0031492647 6.4553924
- A 0.9453125 0.94921875 906544 0.0033349458 6.779603
- A 0.94921875 0.953125 1903403 0.0070021376 7.2964563
- A 0.99609375 1 251045924 0.92353433 53.82328
a A 0.5 0.50390625 135533 0.074519046 3.7259524
a A 0.50390625 0.5078125 20 0.000010996443 7.4524546
a A 0.5078125 0.51171875 484 0.0002661139 7.4663095
a A 0.51171875 0.515625 14 0.00000769751 7.48
a A 0.515625 0.51953125 80762 0.044404734 9.700622
a A 0.51953125 0.5234375 19 0.000010446621 11.921381
a A 0.5234375 0.52734375 451 0.0002479698 11.934302
a A 0.52734375 0.53125 20 0.000010996443 11.94725
a A 0.53125 0.53515625 121830 0.06698483 15.297042
a A 0.53515625 0.5390625 17 0.000009346976 18.64675
a A 0.5390625 0.54296875 423 0.00023257476 18.658846
a A 0.54296875 0.546875 17 0.000009346976 18.670942
a A 0.546875 0.55078125 69130 0.038009204 20.571869
a A 0.55078125 0.5546875 21 0.000011546264 22.472908
a A 0.5546875 0.55859375 415 0.00022817618 22.484894
a A 0.55859375 0.5625 110478 0.06074325 25.533464
a A 0.5625 0.56640625 14 0.00000769751 28.571012
a A 0.56640625 0.5703125 403 0.00022157832 28.582474
a A 0.5703125 0.57421875 16 0.000008797154 28.593996
a A 0.57421875 0.578125 57939 0.031856146 30.187243
a A 0.578125 0.58203125 11 0.0000060480434 31.78035
a A 0.58203125 0.5859375 397 0.00021827938 31.791569
a A 0.5859375 0.58984375 8 0.000004398577 31.802702
a A 0.58984375 0.59375 100221 0.055103723 34.55811
a A 0.59375 0.59765625 8 0.000004398577 37.313515
a A 0.59765625 0.6015625 363 0.00019958544 37.323715
a A 0.6015625 0.60546875 15 0.000008247332 37.334106
a A 0.60546875 0.609375 48833 0.026849464 38.67699
a A 0.609375 0.61328125 17 0.000009346976 40.01993
a A 0.61328125 0.6171875 328 0.00018034165 40.029415
a A 0.6171875 0.62109375 4 0.0000021992885 40.038544
a A 0.62109375 0.625 91300 0.05019876 42.54859
a A 0.625 0.62890625 8 0.000004398577 45.058746
a A 0.62890625 0.6328125 309 0.00016989504 45.067463
a A 0.6328125 0.63671875 6 0.0000032989328 45.076122
a A 0.63671875 0.640625 41197 0.022651022 46.20884
a A 0.640625 0.64453125 303 0.0001665961 47.34972
a A 0.64453125 0.6484375 8 0.000004398577 47.35827
a A 0.6484375 0.65234375 81970 0.04506892 49.611935
a A 0.65234375 0.65625 4 0.0000021992885 51.865494
a A 0.65625 0.66015625 291 0.00015999824 51.8736
a A 0.66015625 0.6640625 9 0.0000049483992 51.88185
a A 0.6640625 0.66796875 33434 0.018382752 52.801235
a A 0.66796875 0.671875 6 0.0000032989328 53.720535
a A 0.671875 0.67578125 244 0.0001341566 53.727406
a A 0.67578125 0.6796875 74149 0.04076876 55.772556
a A 0.6796875 0.68359375 6 0.0000032989328 57.811157
a A 0.68359375 0.6875 261 0.00014350358 57.818497
a A 0.6875 0.69140625 2 0.0000010996442 57.825726
a A 0.69140625 0.6953125 27288 0.015003546 58.57596
a A 0.69921875 0.703125 226 0.00012425981 59.33235
a A 0.703125 0.70703125 66609 0.0366231 61.16972
a A 0.70703125 0.7109375 9 0.0000049483992 63.00112
a A 0.7109375 0.71484375 207 0.000113813185 63.00706
a A 0.71484375 0.71875 2 0.0000010996442 63.012802
a A 0.71875 0.72265625 22132 0.012168664 63.621296
a A 0.72265625 0.7265625 183 0.00010061745 64.23476
a A 0.7265625 0.73046875 3 0.0000016494664 64.23987
a A 0.73046875 0.734375 60349 0.033181217 65.89902
a A 0.734375 0.73828125 192 0.00010556585 67.563354
a A 0.73828125 0.7421875 2 0.0000010996442 67.56868
a A 0.7421875 0.74609375 17848 0.009813226 68.0594
a A 0.74609375 0.75 188 0.00010336656 68.55523
a A 0.75 0.75390625 2 0.0000010996442 68.560455
a A 0.75390625 0.7578125 54182 0.029790463 70.05003
a A 0.7578125 0.76171875 151 0.000083023144 71.54371
a A 0.76171875 0.765625 2 0.0000010996442 71.54791
a A 0.765625 0.76953125 14025 0.0077112555 71.93353
a A 0.76953125 0.7734375 134 0.000073676165 72.32278
a A 0.7734375 0.77734375 2 0.0000010996442 72.326515
a A 0.77734375 0.78125 48652 0.026749946 73.66407
a A 0.78125 0.78515625 128 0.00007037723 75.00508
a A 0.78515625 0.7890625 11192 0.006153609 75.316284
a A 0.79296875 0.796875 121 0.000066528475 75.62729
a A 0.796875 0.80078125 44213 0.024309287 76.846085
a A 0.80078125 0.8046875 106 0.000058281144 78.06446
a A 0.8046875 0.80859375 8640 0.004750463 78.3049
a A 0.8125 0.81640625 109 0.00005993061 78.54542
a A 0.81640625 0.8203125 39701 0.02182849 79.63984
a A 0.8203125 0.82421875 84 0.000046185058 80.733574
a A 0.82421875 0.828125 6686 0.0036761109 80.919685
a A 0.828125 0.83203125 70 0.00003848755 81.105415
a A 0.83203125 0.8359375 35676 0.019615455 82.08811
a A 0.83984375 0.84375 63 0.000034638793 83.07062
a A 0.84375 0.84765625 5092 0.0027996942 83.21233
a A 0.84765625 0.8515625 58 0.000031889685 83.35391
a A 0.8515625 0.85546875 31631 0.017391425 84.22508
a A 0.85546875 0.859375 63 0.000034638793 85.09638
a A 0.859375 0.86328125 3943 0.0021679488 85.20651
a A 0.86328125 0.8671875 44 0.000024192173 85.316124
a A 0.8671875 0.87109375 28243 0.015528627 86.09376
a A 0.87109375 0.875 3023 0.0016621123 86.9533
a A 0.875 0.87890625 47 0.00002584164 87.0377
a A 0.87890625 0.8828125 25426 0.013979778 87.737976
a A 0.8828125 0.88671875 30 0.000016494663 88.43779
a A 0.88671875 0.890625 2193 0.0012057599 88.4989
a A 0.890625 0.89453125 22414 0.012323713 89.17538
a A 0.89453125 0.8984375 21 0.000011546264 89.79214
a A 0.8984375 0.90234375 1674 0.00092040224 89.83874
a A 0.90234375 0.90625 20265 0.011142146 90.441864
a A 0.90625 0.91015625 1281 0.0007043221 91.03419
a A 0.91015625 0.9140625 8 0.000004398577 91.069626
a A 0.9140625 0.91796875 17799 0.0097862845 91.55916
a A 0.91796875 0.921875 932 0.0005124342 92.0741
a A 0.921875 0.92578125 15917 0.008751519 92.53729
a A 0.92578125 0.9296875 744 0.00040906767 92.99532
a A 0.9296875 0.93359375 13912 0.0076491255 93.39823
a A 0.93359375 0.9375 531 0.00029195554 93.79529
a A 0.9375 0.94140625 12339 0.006784255 94.14909
a A 0.94140625 0.9453125 394 0.00021662992 94.49914
a A 0.9453125 0.94921875 11051 0.0060760844 94.813774
a A 0.94921875 0.953125 10258 0.0056400755 95.39958
a A 0.953125 0.95703125 245 0.00013470642 95.688324
a A 0.95703125 0.9609375 8791 0.0048334864 95.93674
a A 0.9609375 0.96484375 7658 0.0042105378 96.38894
a A 0.96484375 0.96875 6805 0.0037415395 96.78654
a A 0.96875 0.97265625 6316 0.0034726765 97.14725
a A 0.97265625 0.9765625 5580 0.0030680075 97.47428
a A 0.9765625 0.98046875 9133 0.0050215255 97.87876
a A 0.98046875 0.984375 7268 0.0039961073 98.329636
a A 0.984375 0.98828125 5844 0.0032131604 98.6901
a A 0.98828125 0.9921875 6466 0.00355515 99.02852
a A 0.9921875 0.99609375 6768 0.0037211962 99.392334
a A 0.99609375 1 7668 0.0042160363 99.7892
Besides, when I use modkit sample-probs ../../filt_bam/200nM_pA-Hia5.srt.filt.l1kb.noMP.bam -f 0.1 -p 0.1,0.2,0.3,0.4,0.5 -t 40, the cutoff around to 1. I wondered if the is the dtype issue?
> in the next version of modkit this command will be `modkit modbam sample-probs`
> sampling 10% of reads
base percentile threshold
C 10 0.9296875
C 20 0.95703125
C 30.000002 0.96875
C 40 0.97265625
C 50 0.98046875
A 10 1
A 20 1
A 30.000002 1
A 40 1
A 50 1
Thanks again!
Best, Kun
Hello @KunFang93,
Could you tell me which version of Modkit and Dorado you've basecalled/modcalled your data with? If you're using Dorado v1.0.0 you must use Modkit v0.5.0.
To your questions.
When I use modkit pileup, I noticed the estimated cut off is 0.994. I would like to ask if this proba could be calculated from probabilities.tsv and how?
You cannot easily calculate the estimated pass threshold from probabilities.tsv since these probabilities are grouped by modification state and the automatic threshold detection is performed on all modification calls combined together. If you want to find the pass threshold value that would filter out the lowest 10% of calls per modification state find the row where percentile_rank is close to 10. Based on this table the threshold value for canonical Adenine calls is somewhere between 0.94921875 and 0.99609375. This happens because the default behavior in Dorado is to clamp canonical probabilities >=95% to 100%. Using 0.94921875 is probably fine.
For 6mA calls the value would be somewhere between 0.515625 and 0.51953125 both of which are quite low. This makes me think there is very little 6mA in this sample and what you have here is the false-positive distribution.
Besides, when I use modkit sample-probs ../../filt_bam/200nM_pA-Hia5.srt.filt.l1kb.noMP.bam -f 0.1 -p 0.1,0.2,0.3,0.4,0.5 -t 40, the cutoff around to 1. I wondered if the is the dtype issue?
What's likely happening here is that all of the reads the sampling algo is getting contain entirely implicitly canonical calls where the probability is set to 100% (or 1.0). The sampling algo is slightly different when you use -f 0.1 than the default.
I think what you have here is a very rare 6mA (or m6A) modification that is likely going to be called correctly when you see it, but it will be buried in the other calls when estimating the threshold values.
One thing I'm working on right now is a new approach where pileup will drop the lowest 10% of calls per position. I think this will end up working better for your use case (unless I'm wildly off in guessing what that is).
I don't want to say "wait for a new feature before continuing your research" so I would generally recommend using an option like --filter-threshold A:0.8 or --mod-threshold a:0.85 and see if you can find positions with high percentages of a-modification.
Happy to help more.
Hi @ArtRand
Thanks so much for your reply! Yes, I am using dorado v1.0.0 and modkit v0.5.0.
For 6mA calls the value would be somewhere between 0.515625 and 0.51953125 both of which are quite low. This makes me think there is very little 6mA in this sample and what you have here is the false-positive distribution.
You’re absolutely right—I mistakenly pasted the probabilities.tsv file from the Genomic DNA sample, so a cutoff around 0.51–0.52 is expected there. However, when I check the 200nM free Hia5 sample, the 10th percentile value is only around 0.55–0.56, which still seems quite low. I would have expected a much higher modification signal given the nature of this sample.
One thing I'm working on right now is a new approach where pileup will drop the lowest 10% of calls per position. I think this will end up working better for your use case (unless I'm wildly off in guessing what that is).
That sounds like a fantastic improvement! I imagine it would work especially well with good coverage to ensure statistical power. For lower coverage data, maybe aggregating over a small window (e.g., 10 bp) could also be effective? Just a thought—I’m really excited to try this approach once it’s available.
I don't want to say "wait for a new feature before continuing your research" so I would generally recommend using an option like --filter-threshold A:0.8 or --mod-threshold a:0.85 and see if you can find positions with high percentages of a-modification.
Thanks for this suggestion! I’m currently trying to balance retaining as many 6mA calls as possible in the 200nM sample while minimizing false positives in the genomic DNA control. If I use a relatively loose threshold, I worry about increasing background noise. On the other hand, stricter thresholds reduce sensitivity in the Hia5 sample. I’m still experimenting to find the optimal cutoff. Please let me know if you have any suggestion.
Thanks again!
Best, Kun
Hello @KunFang93,
However, when I check the 200nM free Hia5 sample, the 10th percentile value is only around 0.55–0.56, which still seems quite low. I would have expected a much higher modification signal given the nature of this sample.
I don't know what kind of experiment you're doing, but what I would suggest (and what I did for the chromatin accessibility assay that we've developed) is find some regions where you know the 6mA should be. In the chromatin accessibility case, we use housekeeping gene promoters. Then run modkit localize on those regions. What do you think?
Sorry for the delayed reply—I somehow missed your message earlier. Thank you for sharing the link; it’s very informative and helpful! I’ll proceed with the sanity check as you suggested.
I have one additional (possibly naive) question regarding the threshold settings: if the value for --mod-thresholds is lower than --filter-threshold, does that mean the mod-thresholds essentially won’t affect which modification calls are made?
For example, if I set --filter-threshold A:0.8 and --mod-threshold a:0.6, and I encounter the following:
{p_A: 0.16, p_a: 0.84}, as 0.84 >= threshold_A: 0.8, so we make the call: 6mA (m)
Since p_a = 0.84 ≥ threshold_A: 0.8, this would qualify as a 6mA call. And because the filter-threshold is more stringent than the mod-threshold, the mod call would always be made as long as it passes the filter-threshold, correct?
If this is the case, in the probabilities.tsv, if the 10% percentage of a is consistently lower than that of A (say, a < 10% and A > 10%), then we don’t need to be overly concerned about these entries—is that a fair interpretation?
Thanks again for your support and guidance!
Best, Kun
Hello @KunFang93,
I've attached a build that allows "per-position" filtering. You can also build it from this branch.
To use it, add the --position-thresholds flag to your modkit pileup command. It will use the --filter-percentile option (0.1/10% by default) to calculate a local threshold value with the probabilities at each genomic position and remove calls with probabilities below this threshold. There is an additional nuance to (help) handle positions with low coverage: --num-pseudo-probs (default 5). The way it works is it calculates 5 probabilities at equally spaced quantiles from 0.1 to 0.9, then adds these probabilities to the threshold calculation at each genomic position. I don't know if this is the best way to do it - but it seems to do what I want which is make a "weighted average" of the probabilities at each position with the global threshold. You can disable this by setting --num-pseudo-probs 0, but then positions with 1 read will not be filtered. There are also additional columns with statistics about the probabilities at each position (min, max, mean, std, median) as well as the actual threshold value used.
Hi @ArtRand,
Thank you so much for your support and this sounds like a fantastic method! I’ll give it a try and let you know how it works with my dataset.
Best, Kun