modkit icon indicating copy to clipboard operation
modkit copied to clipboard

modkit: filtering options are not clear

Open SuhasSrinivasan opened this issue 4 months ago • 15 comments

Dear ONT Team,

Thank you for developing modkit.

The documentation here: https://nanoporetech.github.io/modkit/advanced_usage.html and here: https://github.com/nanoporetech/modkit/blob/master/filtering.md, are not clear about the purpose for two thresholds that appear in many modkit commands

  • --filter-threshold
  • --mod-thresholds

Questions:

  1. Does automated filtering --filter-threshold by sampling reads happen first, to determine q for lowest 10th percentile of canonical and modified base calls?
  2. Is --mod-thresholds the second filter to be applied to only label modified/unmodified bases?
  3. What are suggested values for --mod-thresholds for DNA and RNA modifications respectively?
  4. How do these two filters relate to dorado's --modified-bases-threshold?

Thank you for your consideration.

SuhasSrinivasan avatar Aug 04 '25 17:08 SuhasSrinivasan

Hello @SuhasSrinivasan,

The --filter-threshold option is mostly a convienence so that if you have multiple runs or run modkit sample-probs as a preliminary step that you don't need to re-run the threshold estimation procedure. In general we recommend using the automatic threshold calculation.

To your questions:

  1. Does automated filtering --filter-threshold by sampling reads happen first, to determine q for lowest 10th percentile of canonical and modified base calls?

Correct, it uses all canonical and modified calls to calculate the 10th percentile.

  1. Is --mod-thresholds the second filter to be applied to only label modified/unmodified bases?

The --mod-thresholds only apply to the modification code specified, so --mod-thresholds m:0.9 will only apply to m (5mC) probabilities. There are some worked examples in the docs.

  1. What are suggested values for --mod-thresholds for DNA and RNA modifications respectively?

We do not have suggested values since it is difficult to make such values that apply to every experiment. We generally recommend using the default settings (automatic calculation). If you have a use case where you want to use specific thresholding, we generally recommend using modkit sample-probs --hist --out-dir $path_to_results and inspecting the probability distributions to set specific theshold values.

  1. How do these two filters relate to dorado's --modified-bases-threshold?

Dorado's --modified-bases-threshold sets the value for which canonical calls >= this value will be clamped to 100% canonical probability. This is a space saving measure so that long runs of confidently canonical calls don't take up space in the MM/ML tags. If you don't have a space consideration, and you're worried about loosing some resolution, set --modified-bases-threshold 0.0.

ArtRand avatar Aug 06 '25 17:08 ArtRand

Hi @ArtRand ,

Thank you for reviewing and the extensive details! A follow up question:

We do not have suggested values since it is difficult to make such values that apply to every experiment. We generally recommend using the default settings (automatic calculation).

Are you saying that in addition to --filter-threshold automated calculation, there is also an automated calculation for --mod-thersholds ?

SuhasSrinivasan avatar Aug 06 '25 18:08 SuhasSrinivasan

Hello @SuhasSrinivasan,

Not quite. The routine for automatic threshold calculation collects all of the call probabilities*, and determines the value of the 10th percentile. It will not attempt to calculate the 10th percentile value for each modification (required for an automatic calculation of --mod-thresholds). There are a couple reasons for this:

  1. The sampling algorithm, while not perfect, is designed to be somewhat quick. So it samples the reads (10K of them by default) as evenly as possible over the aligned reference. If the sequenced DNA/RNA, in reality, doesn't have any of a particular base modification (e.g. PCR data) or the modification is extremely rare, then it's unlikely (or impossible in the PCR case) to know that the algorithm has collected enough true positive data points to calculate the 10th percentile of that distribution. What ends up happening is the distribution you get is what we call the "false positive distribution" which is made of mostly low-confidence (likely incorrect) calls and the calculated threshold value ends up being low or close to $ \frac{1}{N_{\text{probs}} $ and you end up with a bunch of incorrect false positive calls in the downstream analysis.

There is nothing wrong with using --mod-threshold if you know that a certain modification is present in your sample and you use sample-probs or extract calls to determine that value, but Modkit can't make assumptions about what's in your data up front. If your modification is very rare, you might have to sample lots of reads to get enough true positive examples. More on that at the end.

  1. The second reason is less exciting, but there's nothing to say that your modBAM isn't composed of a few different runs' of data, so without exhaustively searching all of the reads, Modkit can't know it has seen all of the modification codes it will encounter in pileup.

* Call probabilities are collected from a sample of ~10K reads. For each position within each read with a base modification prediction, the call probability is the maximum of these predictions. In other words, the most likely base modification state.

If you do have an extremely rare modification, and you want to try and calculate a sensible threshold value for it. You can either (1) run modkit sample-probs --no-filtering or (2) try the "per-position-thresholding" described in this issue. One warning is option (1) may consume a lot of memory until I fix it. (2) is a totally different way of calculating the threshold on a "per-position" basis. I haven't decided if it's a better method or not, so it probably won't make it into the next release as it stands now (but maybe you have a good use case!).

ArtRand avatar Aug 06 '25 23:08 ArtRand

Hi @ArtRand ,

Wow, thank you for all these details! Will be very helpful. Just to confirm modkit sample-probs --no-filtering is the go to method to scan for modification probabilities?

SuhasSrinivasan avatar Aug 07 '25 15:08 SuhasSrinivasan

Hello @SuhasSrinivasan

Just to confirm modkit sample-probs --no-filtering is the go to method to scan for modification probabilities?

Correct. Although it can consume a lot of memory right now if you use a large BAM. It's designed to be used quickly and sample a subset of the reads. I would only use --no-filtering if 1) you have a very small modBAM and want to use the whole thing or 2) you have a very rare modification and you want to maximize the chances you'll collect some true positive observations.

ArtRand avatar Aug 07 '25 17:08 ArtRand

Hi @ArtRand ,

Thank you again! When you say a small modBAM, how small should it be?

SuhasSrinivasan avatar Aug 08 '25 00:08 SuhasSrinivasan

@SuhasSrinivasan,

If you have a BAM that is extracted from a region, or has <10K reads you might see that the sampling algorithm doesn't sample all of them, I would use --no-filtering in a case like this.

ArtRand avatar Aug 08 '25 13:08 ArtRand

@ArtRand ,

Thank you, I see! We have millions of reads thanks to your amazing sequencers! 😄

SuhasSrinivasan avatar Aug 08 '25 18:08 SuhasSrinivasan

Hi @ArtRand ,

After using the suggested sample-probs with 1% sampling on indexed modBAM, and then using the probabilities.tsv to find the bottom ~10% modification probability (using the 10% strategy as --fliter-threshold) for each modification type, the following Warnings are emitted. Would appreciate additional input!

> parsed user-input threshold 0.5 for mod-code m
> parsed user-input threshold 0.55 for mod-code 17802
> parsed user-input threshold 0.7 for mod-code a
> calculated chunk size: 24, interval size 100000, processing 2400000 positions concurrently
> attempting to sample 10042 reads
> Using filter threshold 0.9042969 for C.
> Using filter threshold 0.9511719 for T.
> Using filter threshold 0.9863281 for A.
> Threshold of 0.5 for mod code m is very low. Consider increasing the filter-percentile or specifying a higher threshold.
> Threshold of 0.55 for mod code 17802 is very low. Consider increasing the filter-percentile or specifying a higher threshold.
> Threshold of 0.7 for mod code a is low. Consider increasing the filter-percentile or specifying a higher threshold.

If setting a bottom 20% threshold, then this is reported:

> Threshold of 0.64 for mod code 17802 is low. Consider increasing the filter-percentile or specifying a higher threshold.
> Threshold of 0.55 for mod code m is very low. Consider increasing the filter-percentile or specifying a higher threshold.

P.S.

(1) run modkit sample-probs --no-filtering

The documentation does not contain the --no-filtering option for sample-probs.

SuhasSrinivasan avatar Aug 10 '25 06:08 SuhasSrinivasan

Hello @SuhasSrinivasan,

I apologize, I had the flag incorrect, the correct flag for sample-probs is --no-sampling. I've made some adjustments based on your other issues so that the command line options and flags are more consistent. I think that sample-probs --no-sampling still a little more "conversational", but I've made an alias so --no-filtering can be passed.

ArtRand avatar Aug 12 '25 00:08 ArtRand

Hi @ArtRand ,

Thank you for reviewing!

But after using sample-probs for 1% sampling (at least 2x greater than 10042) and then using the probabilities.tsv to find the bottom ~10% modification probability (using the 10% strategy as --fliter-threshold) for each modification type. But how and why is the following message reported, that is, is this more confident than sample-probs?

> parsed user-input threshold 0.5 for mod-code m
> parsed user-input threshold 0.55 for mod-code 17802
> parsed user-input threshold 0.7 for mod-code a
> calculated chunk size: 24, interval size 100000, processing 2400000 positions concurrently
> attempting to sample 10042 reads
> Using filter threshold 0.9042969 for C.
> Using filter threshold 0.9511719 for T.
> Using filter threshold 0.9863281 for A.
> Threshold of 0.5 for mod code m is very low. Consider increasing the filter-percentile or specifying a higher threshold.
> Threshold of 0.55 for mod code 17802 is very low. Consider increasing the filter-percentile or specifying a higher threshold.
> Threshold of 0.7 for mod code a is low. Consider increasing the filter-percentile or specifying a higher threshold.

If setting a bottom 20% threshold, then this is reported:

> Threshold of 0.64 for mod code 17802 is low. Consider increasing the filter-percentile or specifying a higher threshold.
> Threshold of 0.55 for mod code m is very low. Consider increasing the filter-percentile or specifying a higher threshold.

SuhasSrinivasan avatar Aug 12 '25 00:08 SuhasSrinivasan

Hello @SuhasSrinivasan,

These are hard-coded warnings meant to signal something may be off. I can see that it's redundant when you've set the values manually, but I think it's still worth printing the warning (maybe you can convince me otherwise). Since you already know what the setting you want is, you can ignore these warnings.

ArtRand avatar Aug 13 '25 01:08 ArtRand

Hi @ArtRand ,

Thank you for the clarification! Due to the Warning, I was suspicious if the sampling (or a subsequent internal task) during pileup is better than sample-probs.

SuhasSrinivasan avatar Aug 13 '25 01:08 SuhasSrinivasan

Hello @SuhasSrinivasan,

No problem. The sampling still happens and you'll see these log lines:

> Using filter threshold 0.9042969 for C.
> Using filter threshold 0.9511719 for T.
> Using filter threshold 0.9863281 for A.

since Modkit doesn't know if you've specified a mod-threshold for every modification it will encounter, so it calculates a fall-back. It also uses these values for unmodified calls (if you haven't specified one of those).

ArtRand avatar Aug 13 '25 01:08 ArtRand

Hi @ArtRand ,

Yes, plotted the probabilities for Canonical bases and saw the HTML plots too that contain everything. Thank you for the clarification again! But the messages suggest that probabilities are sampled only for Canonical bases and no sampling and/or filtering is done for modified bases. Why aren't there similiar messages for Modified bases when doing a default/no option run?

SuhasSrinivasan avatar Aug 13 '25 04:08 SuhasSrinivasan