dorado icon indicating copy to clipboard operation
dorado copied to clipboard

Diffrent threshold from dorado models

Open Macdot3 opened this issue 1 year ago • 7 comments

Hi everyone, I ran some samples using an updated Dorado model and I'm getting different thresholds from Modkit for the same sample. In my case, I have a lower calling threshold compared to, for example, the same sample processed with an older model and version. What could be causing this? What impact might it have on my downstream analysis, and what do you recommend I do in this case? Below, I’ve included the outputs.

with dorado v0.5.3 - model [email protected]
modkit pileup --ref rCRS_16426.fasta --cpg --combine-strands ../PCR_MT10288M_final_filtered.bam ../PCR_MT10288M.bed
> calculated chunk size: 6, interval size 100000, processing 600000 positions concurrently
> filtering to only CpG motifs
> attempting to sample 10042 reads
> Using filter threshold 0.90234375 for C.
> Done, processed 858 rows. Processed ~432 reads and skipped zero reads
with dorado v0.7.1 - model [email protected]
 modkit pileup --ref rCRS_16426.fasta --cpg --combine-strands ../PCR_MT10288M_final_filtered_new.bam ../PCR_MT10288M_new.bed
> calculated chunk size: 6, interval size 100000, processing 600000 positions concurrently
> filtering to only CpG motifs
> attempting to sample 10042 reads
> Threshold of 0.50390625 for base C is very low. Consider increasing the filter-percentile or specifying a higher threshold.
> Done, processed 856 rows. Processed ~434 reads and skipped zero reads

Thank you very much for your help.

Macdot3 avatar Jun 19 '24 14:06 Macdot3

Hello @Macdot3,

Could you do two things for me to help diagnose this issue?

  1. Tell me the exact dorado basecalling command you used so I know the basecalling model and the modified bases model?
  2. Run modkit sample-probs ../PCR_MT10288M_final_filtered.bam --hist ./probability_histograms and send me the contents of the directory?

Thanks

ArtRand avatar Jun 19 '24 15:06 ArtRand

Hi @ArtRand , Here is the folder for you PCR_MT10288_filter.zip. Compared to what I wrote above, the threshold is around 0.86 because I had forgotten to apply a filter with samtools.
Regarding the Dorado commands, for the file PCR_MT10288M_final_filtered, I have these:

cd ../dorado-0.5.3-linux-x64/bin
./dorado basecaller /Model/[email protected]/ /home/Nanopore/Dorado/POD5/POD5_barcode12_PCR/ --modified-bases 5mCG_5hmCG --device cpu > /home/CALLS/PCR_MT10288.bam

This was followed by alignment with dorado aligner. I subsequently ran the same sample with these new versions:

cd ../dorado-0.7.1-linux-x64/bin
./dorado basecaller /Model/[email protected]/ /home/Nanopore/Dorado/POD5/POD5_barcode12_PCR/ --modified-bases 5mCG_5hmCG --device cpu > /home/CALLS/PCR_MT10288_new_model.bam

Macdot3 avatar Jun 20 '24 07:06 Macdot3

We have been able to reproduce a similar result and are looking into this. It does appear that the v5 5mC+5hmC model produces lower confidence canonical calls than the v4.3 5mC+5hmC model, but the overall accuracy is improved with the v5 model. We will dig into this result a bit further and aim to produce a more robust modified base model in future releases. I would suggest that you can use the results with confidence as the accuracy of the v5 model is increased even if the canonical probabilities have lowered a bit.

One point that may help a bit is setting the threshold manually. We will be releasing a ground truth analysis blog post in the coming months with this information, but for the v5 model we find that a threshold of about 0.76 works best on a set of balanced C/5mC/5hmC calls. This will filter canonical calls a bit more heavily than the previous v4.3 model, but should produce more accurate results overall. Note that this threshold will change for different basecalling and modified base models. The blog post will outline how this threshold is determined and allow users to estimate a new threshold for new models/conditions.

marcus1487 avatar Jun 20 '24 16:06 marcus1487

Hopefully you can answer this still....are we discussing setting the --modified-bases-threshold to 0.76, or another threshold?

DHmeduni avatar Aug 05 '24 11:08 DHmeduni

Bumping this thread as I am seeing the same problem with 4.3 vs 5.0 model in downstream modkit. This defeats the purpose of automatic threshold detection. Do you have a timeline for the new models? I can't trust summary methylation with v5.0 since threshold settings alter the results so drastically.

faulk-lab avatar Aug 25 '24 17:08 faulk-lab

Hi @marcus1487 and @ArtRand there any updates on the publication of the post about the optimized threshold values for the v5 models? Having more detailed information on this would be very helpful to ensure the data remains accurate. Thank you very much

Macdot3 avatar Sep 23 '24 14:09 Macdot3

Hello @faulk-lab, @Macdot3, and @DHmeduni,

Sorry for the delay following up.

We have released updates to the 5hmC/5mC models and they should have generally more confident canonical calls. We have also released data and a blog post describing how the models are tested.

@faulk-lab

Bumping this thread as I am seeing the same problem with 4.3 vs 5.0 model in downstream modkit. This defeats the purpose of automatic threshold detection.

Could you explain what you mean? If the models change in their output distributions, the automatic threshold detection will continue to achieve it's task, broadly speaking. I suppose that if the emitted probability distribution for one state has more mass on lower probabilities, more of those calls will be filtered. As we've discussed above, you can handle such a case by setting individual thresholds, but really the model should be corrected to emit similar probability distributions for all states. Happy to continue the conversation over email or over on the Modkit GH page.

ArtRand avatar Mar 21 '25 16:03 ArtRand

Closing, based on Art's assertion that the new models should be improved. If this issue persists, please raise an issue over on the modkit repo.

malton-ont avatar Jul 09 '25 15:07 malton-ont