modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Setting up a suitable value for --filter-threshold or --percentile for modkit pileup command

Open gsukrit opened this issue 4 months ago • 5 comments

Hi Team,

I basecalled my data using Dorado in Super accuracy mode for 5mC and 5hmC in CG and 6mA all context and aligned with the reference genome during basecalling. I am using modkit to convert modification info after basecalling into modkit pileup output. But some of my samples are sequenced on R9 and the remaining on R10 chemistry. Due to this, when i do --sample-probs on each of my samples, the 10th percentile value on R9 samples I noticed is ~0.62 while that for R10 sample is ~0.85. Due to this default threshold, i am getting huge variation in proportion of methylated base variants across all samples. Samples in Row A4:A13 behave similarly - sequenced on R9 and those from row A14:A19 behave similarly are sequenced on R10 chemistry. I tried different thresholds - 0.75, 0.82, 0.85 but the variation remains fairly same. What is a single best way to proceed in this case that can address the difference due to chemistries (is that possible to normalize this effect?) without includingl false positive modification calls in data. I am attaching below the proportion of reads for each base variant i calculated from the pileup output using different thresholds. Looking forward to your recommendation and advise. Thanks!!

Image

gsukrit avatar Aug 25 '25 11:08 gsukrit

So When I ran the pileup command with --filter-percentile 0.1, letting modkit decide it;s own default <10% confidence calls. I have summarized the default thresholds it has chosen for each sample:

Image

As we can see, the default for A is 1, which means almost all As will be called as canonical and I will eliminate the legit 6mA calls due to modkit;s threshold probability calculation. What is advised in this case ? Eagerly looking forward to your response and advise, @ArtRand
Thank you, Sukriti

gsukrit avatar Aug 28 '25 15:08 gsukrit

Hello @gsukrit,

Sorry for the delay in getting back to you.

Could you tell me which version of Dorado and the base modification models you're using (for R9 and R10)? I suspect that since you're getting a threshold value of 1.0 for 6mA that you have Dorado >=1.0. Dorado >=1.0 requires that you use Modkit 0.5.0 (in fact using the build here is the best until I get a patch release out). You can get a threshold value of 1.0 with the v5.2.0 models. This can happen because the latest models are more confident in their canonical calls. The default behavior of dorado is to elide base modification probabilities <= 5%, these are interpreted by Modkit as 100% confident canonical calls. If there are considerably more unmodified adenines in the sample all of the true 6mA probabilities end up below the 10th percentile.

As far as the cytosine calls, could you tell me what happens when you run modkit modbam check-tags --head ${r9_modbam}? Thanks.

ArtRand avatar Aug 29 '25 22:08 ArtRand

Hi,

Thank you so much for responding. So I performed basecalling of both FAST5 (R9) and POD5 (R10 samples) using MinKnow software v24.11.10 (It bundles with dorado and calls it internally), keeping all the basecalling parameters same. Basecalling model used for R9 was SUP mode for 5mC and 5hmC in CG and R10 it was SUP mode for 5mC and 5hmC in CG and 6mA in all contexts. In the software settings panel, the following versions get printed: Dorado - 7.6.8 | Bream- 8.2.5 | Script Configuration- 6.2.12

Image

The modkit version used is v0.4.2 As for the cytosine command you have recommended, I ran it on the upgraded version of modkit using one of the R9 bam files and have got this output:

Image

Hope this leads to some answers. Awaiting your response.

gsukrit avatar Aug 30 '25 13:08 gsukrit

@ArtRand Hi, I was hoping to get some follow-up on what's the best way to proceed here.. Looking forward to your reply,

Thank you!

gsukrit avatar Nov 15 '25 09:11 gsukrit

Hello @gsukrit,

Ok. I was initially concerned that potentially the R9 modification calls had an older version of the modification tags that can be misinterpreted - but what you have here looks fine.

As far as comparing the CpG calls between R9 and R10, unfortunately I don't have a routine procedure for you to follow. What you're doing, using the 10th percentile threshold - estimated for each sample, is the most robust thing to do. I would maybe consider the biological question you're asking. If, for example, you're looking for DMRs or haplotype-specific methylation both chemistries should perform well.

ArtRand avatar Nov 20 '25 04:11 ArtRand

Dear @ArtRand Thank you for taking a look. But isn't it weird that due to modkit's estimated threshold of 1, the final modification in A base is absolutely negligible in all samples. Plus, the C base modification of R9 and R10 vary drastically (as per the table observation shared on Aug 25, first comment of this thread). Could you please guide us as to what's best to do here for analyzing both the datasets. The biological question is to profile genome-wide methylation of both C and A bases and corelate it with transcript abundance; profile DMRs across the samples or identify methylation enriched regions across sample types.

gsukrit avatar Nov 29 '25 09:11 gsukrit

Hello @gsukrit,

Sorry I forgot to mention this above. There is a setting in Dorado where if the probability of a base being unmodified is $\ge$ this number the probability is omitted. Modkit interprets these as 100% confident unmodified positions. It is possible that when constructing the estimated distribution of probabilities, the 10-th percentile falls on this value (1.0). Modkit >v.0.5.1 has a fix so that when this happens, the probability threshold is set to the highest observed probability value, which usually ends up being the Dorado threshold.

tl;dr is try using Modkit v0.5.1 or better yet v0.6.0 and see if the A-modification probability is just slightly lower than 1.0. You could still have very low levels of 6mA in your sample, but the high confidence calls will be kept.

ArtRand avatar Dec 02 '25 00:12 ArtRand

@ArtRand Thank you for your response. Dorado was run with default parameters. If the dorado modification threshold you are referring to is this: --modified-bases-threshold The minimum predicted methylation probability for a modified base to be emitted in an all-context model, [0, 1]. [nargs=0..1] [default: 0.05]

The default value here is 0.05, which would anyway be filtered out since modkit applies its own threshold, which in my case is 0.6 for R9 (C base), 0.85 for R10 samples (C base) and 1 for R10 samples (A base).

Kindly let me know if I this setting is acceptable or I need to re-run dorado with --modified-bases-threshold value explicitly set to 0.

gsukrit avatar Dec 05 '25 08:12 gsukrit