modkit Questions about modkit segmentation model and parameter suggestions

Sorry to bother you again, and thank you for your previous help. Below is my current understanding of the segmentation model — I’m not sure if it is fully correct, and I would really appreciate your feedback and suggestions:

The segmentation model in modkit is a two-state Hidden Markov Model (HMM), where each sufficiently-covered modified position is assigned to one of two hidden states: "Same" (not differentially methylated) or "Different" (differentially methylated). The observation at each position is the likelihood ratio score (e-score), which reflects the strength of evidence for differential modification at that site. This score is transformed into emission probabilities for the HMM: p(Same) = e⁻ˢᶜᵒʳᵉ p(Different) = 1 - p(Same)

The model also uses transition probabilities to define how likely it is to switch between states:

--dmr-prior controls the probability of entering the "Different" state from "Same" (default: 0.1) --diff-stay sets the maximum probability of staying in the "Different" state (default: 0.9) --decay-distance determines how quickly the "stay-in-Different" probability decays with genomic distance between neighboring modified positions (default: 500 bp)

These probabilities are used by the Viterbi algorithm to compute the most likely state path across the sequence of observed positions. At each step, the algorithm selects the state (Same or Different) that maximizes the cumulative path probability up to that point.

Once all positions have been assigned a state, the model outputs segments by merging adjacent positions that share the same state label. This results in contiguous regions labeled either as "Same" or "Different" segments, depending on the inferred hidden state of their constituent positions.

That said, I’ve encountered a few observations and would like to clarify them with your help:

The manual suggests that the segmentation model runs over all sufficiently-covered modified positions. However, in the output, all segment regions (Same or Different) only include positions that were already identified as differential modified site (DMS) by dmr pair. There seem to be no high-coverage but non-significant positions in any segment. Is the segmentation model effectively restricted to DMS positions?
We also observed that the number of "Different" segments is very small, and they include very few DMS. Most of the DMS identified by dmr pair end up in "Same" segments. Is this due to the model being conservative, especially with isolated DMS lacking nearby support? About the --dmr-prior parameter: I understand it sets the prior probability of a position being "Different". While the default value (0.1) seems to reflect the rarity of functional DMRs, this may not reflect the actual frequency of differentially modified sites genome-wide. Could this low prior value be making the model overly conservative? Would a higher value (e.g. 0.2 or above) be more appropriate when sensitivity is prioritized? In our case, we plan to further prioritize DMRs by integrating them with gene expression and regulatory regions. Therefore, having slightly higher sensitivity at this stage is acceptable, even if it means a higher false positive rate. Based on your experience, do you think the following relaxed parameters would be reasonable to try?

--decay-distance 1000 --dmr-prior 0.3 --diff-stay 0.95

3.Does the segmentation model consider the direction of methylation change (e.g. hyper- vs. hypomethylation), or is it only modeling the presence of differential modification regardless of direction?

I would greatly appreciate your thoughts on these settings or any other recommendations you may have. Looking forward to your advice, and thank you again for your time and support! Best regards！

May 29 '25 04:05 hannan666666

Hi, @ArtRand sorry to bother you — I’d really appreciate your help or suggestions.

Jun 04 '25 10:06 hannan666666

@hannan666666

Sorry for the delay.

Jun 10 '25 16:06 ArtRand

@hannan666666

Sorry for the delay.

No worries at all—whenever you have time! Looking forward to your reply !

Jun 12 '25 06:06 hannan666666

Hello @hannan666666

The manual suggests that the segmentation model runs over all sufficiently-covered modified positions. However, in the output, all segment regions (Same or Different) only include positions that were already identified as differential modified site (DMS) by dmr pair. There seem to be no high-coverage but non-significant positions in any segment. Is the segmentation model effectively restricted to DMS positions?

That sounds like a bug. Are you saying that in the DMR pair output there are stretches of positions in the single-site output that are missing in the segmentation output? Do the logs say anything. If you have an example, could you show me?

We also observed that the number of "Different" segments is very small, and they include very few DMS. Most of the DMS identified by dmr pair end up in "Same" segments. Is this due to the model being conservative, especially with isolated DMS lacking nearby support?

Yes. The default parameters are probably conservative. One other thing to keep in mind is that I tested the defaults with a pair of samples that are roughly 30X coverage each. If you have less than that you'll probably want to increase the --significance-factor as mentioned in this issue.

About the --dmr-prior parameter: I understand it sets the prior probability of a position being "Different". While the default value (0.1) seems to reflect the rarity of functional DMRs, this may not reflect the actual frequency of differentially modified sites genome-wide. Could this low prior value be making the model overly conservative? Would a higher value (e.g. 0.2 or above) be more appropriate when sensitivity is prioritized?

I would increase the --significance-factor to 0.05 and experiment with a short region to see what works best.

In our case, we plan to further prioritize DMRs by integrating them with gene expression and regulatory regions. Therefore, having slightly higher sensitivity at this stage is acceptable, even if it means a higher false positive rate. Based on your experience, do you think the following relaxed parameters would be reasonable to try?

The first things I would try are

--significance-factor 0.05
--log-transition-decay

Look at a small region (I think you're already doing this) and see how the FPR/FNR look qualitatively.

Jun 14 '25 00:06 ArtRand