[src/command_utils.rs::123][2025-06-25 22:29:14][DEBUG] estimated pass threshold 1 for primary sequence base C
I'm seeing something strange where for some of my samples the threshold is something crazy high like 1, instead of something more sensible.
Is there anything I can do to debug what's going on here? modkit is v0.4.4 and the command is something like /mnt/ix1/Resources/tools/modkit/v0.4.4/modkit pileup --threads 10 --only-tabs --ref reference.fa --cpg --log-filepath modkit.log sorted.bam modkit.bed
I've noticed that the ones that were called with dorado 0.9.0+9dc15a8 do not do this, but ones called with 1.0.0+4a76f8aa have the weird thresholding.
Sorry for the spam -- I'm now using modkit v0.5 and it looks to be a bit better, but the threshold is very different than the samples called with dorado v0.9. It was ~0.75 for the threshold, and now it's 0.994. Is this an expected result due to new scaling from a new methylation model from dorado v1 to v0.9, or is this still a bug?
You guys are gonna kill me for the comment spam... I looked into the corresponding "extract calls" command for v0.4.4 and it looks like the thresholds are completely different.
Using modkit v0.4.4 pileup: [src/command_utils.rs::123][2025-06-25 21:44:44][DEBUG] estimated pass threshold 1 for primary sequence base C [src/pileup/subcommand.rs::654][2025-06-25 21:44:44][INFO] Using filter threshold 1 for C.
Using modkit v0.4.4 extract calls: [src/command_utils.rs::123][2025-07-17 15:50:01][DEBUG] estimated pass threshold 0.6894531 for primary sequence base C
Using modkit v0.5 pileup: [modkit-core/src/thresholds.rs::176][2025-08-27 12:46:36][DEBUG] estimated threshold too high 1, using 0.9941406 [modkit-core/src/command_utils.rs::123][2025-08-27 12:46:36][DEBUG] estimated pass threshold 0.9941406 for primary sequence base C [modkit-core/src/pileup/subcommand.rs::655][2025-08-27 12:46:36][INFO] Using filter threshold 0.9941406 for C.
Hello @billytcl,
If you're using Dorado >=v1.0.0 you need to use Modkit v0.5.0. But, sadly, it gets a little worse. It looks like you're using the all-context cytosine modification model (correct?). There was a bug in the modified base models included with the v1.0.0 dorado release. Specifically this model: [email protected]_5mC_5hmC@v1 didn't perform as well as we like. I highly recommend moving to dorado v1.1.1 and using [email protected]_5mC_5hmC@v2 (same for HAC!).
Secondly, there is a bug in Modkit v0.5.0 that will use a threshold of 0.9941406 when it should be 0.94921875. There is a bug-fix build here. Alternatively, if you use the option --modified-bases-threshold 0.0 with dorado basecaller then Modkit v0.5 (or 0.4.4) will work fine.
Why does this happen? The latest version of models are more confident in their canonical calls. The default behavior of dorado is to elide base modification probabilities <= 5%, these are interpreted by modkit as 100% confident canonical calls. Since there are considerably more unmodified cytosines in human samples, what happens is all of the true 5mC probabilities end up below the 10th percentile.
I apologize for the patchwork of issues. I'm working on getting a new release of Modkit out right now, should be done soon.
Interesting... do you know why the extract calls seem to give sensible thresholds?
I'm going to be restricting to cpg at the moment anyway. Would that affect my results?
Hello @billytcl,
modkit extract calls --cpg will only use CG sites when calculating the threshold value, pileup aggregates probabilities from the reads without filtering to aligned sequence context or motif. I noticed this difference whilst debugging the v1 model, but I believe it's been this way since v0.4.0, it's just now become noticeable because the distribution of probabilities for unmodified calls is quite different than that of modified ones in the most recent models. I may add a toggle in the next version that restricts the probabilities to ones mapped to the motif in pileup, but I need to do more investigation.
If you only care about reference CG dinucleotides, but want to retain calls where the basecall sequence may be CH aligned to CG, I would:
modkit motif bed $ref CG 0 > ref_cg0.bedmodkit pileup $modbam $output --include-bed ref_cg0.bed [--combine-strands --ref $ref]
FWIW, if you don't want to keep calls where the basecall sequence is CH<aligned_to>CG, you can use the CG modified base models from v1.0.0 and previous Modkit versions. I tested these against each other and orthogonal methods and they work fine.
Great! That would help quite a bit in the meantime. Re-basecalling it all to dorado v1.1.1 would be a huge pain in the near term. It's too bad there's no way to just re-call the methylated bases without going back to the pod5s...
It's too bad there's no way to just re-call the methylated bases without going back to the pod5s...
You can do this with remora (as you probably know), but it's slower than re-basecalling (as you probably also know). TBH, what I do is use HAC for nearly everything, it's quite a bit faster than SUP at this point, and for mods it's very similar. That being said, we have some of the fancier hardware where you can really see the speedups, so ymmv.
Hello @billytcl,
modkit extract calls --cpgwill only use CG sites when calculating the threshold value,pileupaggregates probabilities from the reads without filtering to aligned sequence context or motif. I noticed this difference whilst debugging the v1 model, but I believe it's been this way since v0.4.0, it's just now become noticeable because the distribution of probabilities for unmodified calls is quite different than that of modified ones in the most recent models. I may add a toggle in the next version that restricts the probabilities to ones mapped to the motif inpileup, but I need to do more investigation.If you only care about reference CG dinucleotides, but want to retain calls where the basecall sequence may be CH aligned to CG, I would:
- `modkit motif bed $ref CG 0 > ref_cg0.bed
- `modkit pileup $modbam $output --include-bed ref_cg0.bed [--combine-strands --ref $ref]
FWIW, if you don't want to keep calls where the basecall sequence is CH<aligned_to>CG, you can use the CG modified base models from v1.0.0 and previous Modkit versions. I tested these against each other and orthogonal methods and they work fine.
Good call on using include-bed! I just tried this out and got a somewhat different estimation:
> Threshold of 0.69921875 for base C is low. Consider increasing the filter-percentile or specifying a higher threshold.
I'm guessing it shouldn't be as bad as the threshold being ~1, but for consistency I should probably call even the v0.9 dorado runs with include-bed as well?
Hello @billytcl,
I'm guessing it shouldn't be as bad as the threshold being ~1, but for consistency I should probably call even the v0.9 dorado runs with include-bed as well?
Yes threshold values of 1 will not work since the base modification model will never output this probability the TPR_modification will be 0.0. I've done the comparison using CpG in the --include-bed with without and the %-5mC on chr20 ends up being very nearly identical.
In my run the --cpg threshold was estimated to be 0.9501953 and the --include-bed run estimated 0.8798828. I used the "control" "Native Sample" Pod5s from here.