dorado icon indicating copy to clipboard operation
dorado copied to clipboard

Tandem Modifications in Dorado Predictions

Open lbwfff opened this issue 7 months ago • 15 comments

Hi all,

In my Dorado results, I observed a large number of tandem m⁵C and m⁶A modifications—with only a one-base distance between them at the read level.

Image

This finding is a bit concerning. As far as I understand, Dorado’s modification prediction model was trained on IVT data, in which, as I understand, should not contain transcripts with multiple types of modifications.

Should I trust these co-occurring modification calls, or would it be more appropriate to filter them out?

Thanks, LeeLee

lbwfff avatar May 29 '25 05:05 lbwfff

Can you tell us the dorado version, model versions please?

HalfPhoton avatar May 29 '25 08:05 HalfPhoton

Hi HalfPhoton,

This result is from dorado version 0.9.6 with the model “sup,inosine_m6A,m5C,pseU”, but I get similar results in the latest released “sup,m5C_2OmeC,inosine_m6A_2OmeA,pseU_2OmeU,2OmeG” model.

lbwfff avatar May 29 '25 08:05 lbwfff

I would expect that the RNA models in the latest release would handle this situation a bit better if those are false positive calls. For example, the m5C model would have seen m6A next to canonical C during training. Still there might be some contexts were that's not the case. What kind of data are you working on here? Would you be able to run modkit motif search (docs) on this data?

budnikm avatar May 29 '25 12:05 budnikm

I would expect that the RNA models in the latest release would handle this situation a bit better if those are false positive calls. For example, the m5C model would have seen m6A next to canonical C during training. Still there might be some contexts were that's not the case. What kind of data are you working on here? Would you be able to run modkit motif search (docs) on this data?

These data come from mouse brain tissue, and we sequenced only one test sample for evaluation of the analysis pipeline. I'm trying modkit motif search, this function seems to be very slow and this will take some time. I didn't quite understand your comment “would have seen m6A next to canonical C during training”. I thought the training data comes from epi2me shared dataset (https://epi2me.nanoporetech.com/rna-mod-validation-data/) ,the difference between old and new version is only the addition of 2Ome dataset, is this correct?

lbwfff avatar May 30 '25 05:05 lbwfff

The epi2me dataset is only used for validation. We use different datasets for model training.

There are several changes in the new version of the models. They have a new architecture and are trained on a larger set of samples. We also tried to lower the FPR, both on canonical data and in proximity to other mods.

budnikm avatar May 30 '25 09:05 budnikm

The epi2me dataset is only used for validation. We use different datasets for model training.

There are several changes in the new version of the models. They have a new architecture and are trained on a larger set of samples. We also tried to lower the FPR, both on canonical data and in proximity to other mods.

For these training sets can I find any detailed description anywhere?

I am getting similar results in the new model, and my concern is whether we should be cautious about modification co-occurrence in the predicted results if modification co-occurrence is not present in the training set.

lbwfff avatar May 30 '25 11:05 lbwfff

I'm not sure if we have a description publicly available. In any case, the training datasets don't contain these kinds of co-occurrences. So I agree with your concern and would advise caution.

I would be really useful for us to see the results of the motif search to understand how systematic this is. And also how confident are those co-occurring calls for both m6A and m5C at a given site (using modkit pileup for example).

Another possibility is that there's a different unmodeled modification that triggers both models.

budnikm avatar May 30 '25 12:05 budnikm

Also could you show us how those tandem calls look in IGV? With both the old and the new models? That would be really helpful.

budnikm avatar May 30 '25 15:05 budnikm

Another possibility is that there's a different unmodeled modification that triggers both models.

Yes, that's exactly what I was worried about.

I have not been able to complete the modkit motif search, but in our analysis such co-occurring sites are mostly m6A before m5C, and the motif would be as follows:

Image

This looks like a very common m6A motif.

For those co-occurrence modification sites where m5C is in front of m6A we did not find an obvious motif.

I visualized a representative locus using IGV, and here are the modifications predicted by the [email protected] model series:

Image

We also tested the [email protected] model series, and modification co-occurrence is also observed at this site in both the new model series, although it appears that the frequency of m5C is much lower than in the old model series.

Image

lbwfff avatar Jun 02 '25 05:06 lbwfff

Thanks for that! So just to be clear: the above IGV example is m6A (salmon color) before m5C (in blue)?

budnikm avatar Jun 02 '25 08:06 budnikm

Yes, salmon color is m6A and blue is m5C, and here the modification probability is set at igv to be greater than 0.9 as a threshold.

lbwfff avatar Jun 02 '25 08:06 lbwfff

Right, so it seems to me that the new model might be fixing a lot of these issues. Are there still cases where the m5C is called almost as much as m6A? Are both IGV plots using the same 0.9 threshold? Also could you generate your log-scale distance figure for the new models?

budnikm avatar Jun 02 '25 09:06 budnikm

Yes both IGV maps use the same modification thresholds. With the new model we still observe a certain percentage of co-occurrence of m5C and m6A, but this density is indeed much lower compared to the old model.

Image

Interestingly a certain percentage of co-occurrence of pseU and m6A is seen in the new model.

lbwfff avatar Jun 04 '25 04:06 lbwfff

Thanks a lot! Nice to see that there are some significant improvements with the new models. So it does seem like it's most likely a m5C FP call next to a real m6A. We will continue to look into this and hopefully the next release will see further improvements.

budnikm avatar Jun 09 '25 08:06 budnikm

I've been noticing something similar, using dorado v1.0.2 and [email protected]. In this model, we find high co-occurrence between m5C and m6A. But we don't see the same co-occurrence with Inosine/PseudoU/Nm. (used modkit pileup with default filtering parameters, and used valid coverage ≥ 10 and percent modified ≥ 25 because this optimized the percentage of m6As called in DRACH motifs)

Image

One way that this confounds analysis is that when you plot the distribution of modifications across the gene regions, you get this pattern:

Image

This is classic m6A distribution. But most reports, like the one here https://www.nature.com/articles/cr201755: Image show m5C enrichment mainly at the start codon, with little to no enrichment at the stop codon. My first idea was that if we filter out m5C sites adjacent to m6A, this enrichment would go away (as these sites would likely be FP call next to real m6A). The distribution of the m5C sites adjacent to m6A supports this:

Image

However, when you filter out these m6A adjacent m5C calls, we still see a very prominent enrichment at the stop codon

Image

Thus I'm worried that there is a high FP rate among the ~ 500 modifications at the stop codon. Maybe an m6A or other stop-codon enriched modification is triggering the m5C call?

jpfry327 avatar Jun 26 '25 21:06 jpfry327