Understanding filtering, Nmod and Ncanonical
Hello,
I have a question about filtering the 10% lowest confidence modification calls, and how Nmod and Ncanonical values are obtained.
If modkit (by default) first filters out the 10% lowest confidence calls:
Filter out modified base calls where the probability of the predicted variant is below this confidence percentile. For example, 0.1 will filter out the 10% lowest confidence modification calls
and
Nmod - Number of calls passing filters that were classified as a residue with a specified base modification.
Ncanonical - Number of calls passing filters were classified as the canonical base rather than modified. The exact base must be inferred by the modification code. For example, if the modification code is m (5mC) then the canonical base is cytosine. If the modification code is a, the canonical base is adenosine.
- Is the filtering done based on the same modification probability value which is used to differentiate between modified and unmodified bases?
- If 10% lowest confidence modification calls are filtered out, then the remaining 90% represent the Nvalid_cov? What's confusing here is that while the bottom 10% is filtered out, there will still be unmodified sites (so with very low modification probability or = 0) which will pass the filter and will end up in Ncanonical?
Thanks in advance.
Hello @imilenkovic,
Sorry for the major delay is getting back to you.
Is the filtering done based on the same modification probability value which is used to differentiate between modified and unmodified bases?
First the probability of each state is calculated. If there are 2 modification states (say 5mC and 5hmC) the canonical probability is $1 - P_{\text{5hmC}} - P_{\text{5mC}}$, so you have 3 probabilities. Then the base modification call for the read is determined as the state with the highest probability $\ge$ the threshold. There are some worked examples in the documentation.
If 10% lowest confidence modification calls are filtered out, then the remaining 90% represent the Nvalid_cov? What's confusing here is that while the bottom 10% is filtered out, there will still be unmodified sites (so with very low modification probability or = 0) which will pass the filter and will end up in Ncanonical?
A canonical (or unmodified) base is treated the same as a base modification. So if you have 2 modification states and unmodified you have 3 possibilities. The minimum confidence values would be 0.33 for each. If both modified states have zero probability, than the unmodified probability is 100%, the maximum confidence.
Does this make sense? Happy to clarify further if you have an example you'd like worked out.