Unexcepted output with modkit adjust-mods --convert h m
This is what I have for one read in the original BAM:
MM:Z:C+h?,3,0,0,2,1,0,9,4,4,0,2,2,1,2,3,5,0,0;
C+m?,3,0,0,2,1,0,9,4,4,0,2,2,1,2,3,5,0,0;
ML:B:C, 9,15,14,10,11,15, 6, 6, 7,11,16,10,10,11,11,18,16,18,
26,35,43,56,65,35,27,51,46,42,51,47,55,35,36,68,70,48
This is what I get after
MM:Z:C+m?,3,0,0,2,1,0,9,4,4,0,2,2,1,2,3,5,0,0;
ML:B:C,36,51,58,67,77,51,34,58,54,54,68,58,66,47,48,87,87,67
But I was expecting 35 50 57 66 76 50 33 57 53 53 67 57 65 46 47 86 86 66 instead.
I am using modkit v0.5.1-rc1.
Hello @privefl2,
Sorry about the delayed response to your question.
The values in the ML tag are stored in 8-bit bins. A value of 0 captures probabilities in the half-open interval $[\frac{0}{256}, \frac{1}{256})$ then a value of 1 captures the probabilities in the half-open interval $[\frac{1}{256}, \frac{2}{256})$, all the way up until the final interval $[\frac{255}{256}, 1]$ represented by 255. The routine in Modkit when using adjust-mods --convert h m is to first transform the ML values into probabilities by centering the probability within the 8-bit bin $p = \frac{q + 0.5}{256}$ (the same transformation that is always performed) then sum the probabilities, and finally convert them back into 8-bit binned values. This also partially explains why in modkit extract the modification probability values are never exactly zero or one. Based on this, the converted probabilities for the ML tags you've provided are:
5mC_probs = [
0.103515625,
0.13867188,
0.16992188,
0.22070313,
0.25585938,
0.13867188,
0.107421875,
0.20117188,
0.18164063,
0.16601563,
0.20117188,
0.18554688,
0.21679688,
0.13867188,
0.14257813,
0.26757813,
0.27539063,
0.18945313,
]
5hmC_probs = [
0.037109375,
0.060546875,
0.056640625,
0.041015625,
0.044921875,
0.060546875,
0.025390625,
0.025390625,
0.029296875,
0.044921875,
0.064453125,
0.041015625,
0.041015625,
0.044921875,
0.044921875,
0.072265625,
0.064453125,
0.072265625,
]
summed = [
0.140625,
0.19921875,
0.2265625,
0.26171875,
0.30078125,
0.19921875,
0.1328125,
0.2265625,
0.2109375,
0.2109375,
0.265625,
0.2265625,
0.2578125,
0.18359375,
0.1875,
0.33984375,
0.33984375,
0.26171875,
]
Finally when converted back, you get the values you've observed. Let me know if this isn't clear.
Thank you for the detailed explanation. I was wrongly assuming the intervals to be centered aroung q/255 instead of (q+0.5) / 256.
So, when we sum p1 + p2 = (q1+0.5 + q2+0.5) / 256, then it becomes the lower limit of interval q1+q2+1, I understand. Then, when it is decoded again, we obtain (q1+q2+1+0.5)/256, which is upward-biased by 0.5/256, right?
Hello @privefl2,
Yes you are correct and I think the way you have the summation is better. I'll update this soon. Thanks for noticing this!
But, how would you be able to update this? Either you use q1+q2+1 and you are upward-biased, or use q1+q2 and are downward-biased, right?
@privefl2 I need to do some more in-depth analysis of the implications, but I think summing the probabilities in qual-space then when converting to a probability return the probability in the center of the bin.
If you use q1+q2, then the probability is downward-biased by 0.5/256. Maybe the best solution is to use q1+q2 when <= 127 and q1+q2+1 when >127, but maybe it is too much of a hassle for not much.