Tandem Modifications in Dorado Predictions
Hi all,
In my Dorado results, I observed a large number of tandem m⁵C and m⁶A modifications—with only a one-base distance between them at the read level.
This finding is a bit concerning. As far as I understand, Dorado’s modification prediction model was trained on IVT data, in which, as I understand, should not contain transcripts with multiple types of modifications.
Should I trust these co-occurring modification calls, or would it be more appropriate to filter them out?
Thanks, LeeLee
Can you tell us the dorado version, model versions please?
Hi HalfPhoton,
This result is from dorado version 0.9.6 with the model “sup,inosine_m6A,m5C,pseU”, but I get similar results in the latest released “sup,m5C_2OmeC,inosine_m6A_2OmeA,pseU_2OmeU,2OmeG” model.
I would expect that the RNA models in the latest release would handle this situation a bit better if those are false positive calls. For example, the m5C model would have seen m6A next to canonical C during training. Still there might be some contexts were that's not the case.
What kind of data are you working on here?
Would you be able to run modkit motif search (docs) on this data?
I would expect that the RNA models in the latest release would handle this situation a bit better if those are false positive calls. For example, the m5C model would have seen m6A next to canonical C during training. Still there might be some contexts were that's not the case. What kind of data are you working on here? Would you be able to run
modkit motif search(docs) on this data?
These data come from mouse brain tissue, and we sequenced only one test sample for evaluation of the analysis pipeline. I'm trying modkit motif search, this function seems to be very slow and this will take some time. I didn't quite understand your comment “would have seen m6A next to canonical C during training”. I thought the training data comes from epi2me shared dataset (https://epi2me.nanoporetech.com/rna-mod-validation-data/) ,the difference between old and new version is only the addition of 2Ome dataset, is this correct?
The epi2me dataset is only used for validation. We use different datasets for model training.
There are several changes in the new version of the models. They have a new architecture and are trained on a larger set of samples. We also tried to lower the FPR, both on canonical data and in proximity to other mods.
The epi2me dataset is only used for validation. We use different datasets for model training.
There are several changes in the new version of the models. They have a new architecture and are trained on a larger set of samples. We also tried to lower the FPR, both on canonical data and in proximity to other mods.
For these training sets can I find any detailed description anywhere?
I am getting similar results in the new model, and my concern is whether we should be cautious about modification co-occurrence in the predicted results if modification co-occurrence is not present in the training set.
I'm not sure if we have a description publicly available. In any case, the training datasets don't contain these kinds of co-occurrences. So I agree with your concern and would advise caution.
I would be really useful for us to see the results of the motif search to understand how systematic this is. And also how confident are those co-occurring calls for both m6A and m5C at a given site (using modkit pileup for example).
Another possibility is that there's a different unmodeled modification that triggers both models.
Also could you show us how those tandem calls look in IGV? With both the old and the new models? That would be really helpful.
Another possibility is that there's a different unmodeled modification that triggers both models.
Yes, that's exactly what I was worried about.
I have not been able to complete the modkit motif search, but in our analysis such co-occurring sites are mostly m6A before m5C, and the motif would be as follows:
This looks like a very common m6A motif.
For those co-occurrence modification sites where m5C is in front of m6A we did not find an obvious motif.
I visualized a representative locus using IGV, and here are the modifications predicted by the [email protected] model series:
We also tested the [email protected] model series, and modification co-occurrence is also observed at this site in both the new model series, although it appears that the frequency of m5C is much lower than in the old model series.
Thanks for that! So just to be clear: the above IGV example is m6A (salmon color) before m5C (in blue)?
Yes, salmon color is m6A and blue is m5C, and here the modification probability is set at igv to be greater than 0.9 as a threshold.
Right, so it seems to me that the new model might be fixing a lot of these issues. Are there still cases where the m5C is called almost as much as m6A? Are both IGV plots using the same 0.9 threshold? Also could you generate your log-scale distance figure for the new models?
Yes both IGV maps use the same modification thresholds. With the new model we still observe a certain percentage of co-occurrence of m5C and m6A, but this density is indeed much lower compared to the old model.
Interestingly a certain percentage of co-occurrence of pseU and m6A is seen in the new model.
Thanks a lot! Nice to see that there are some significant improvements with the new models. So it does seem like it's most likely a m5C FP call next to a real m6A. We will continue to look into this and hopefully the next release will see further improvements.
I've been noticing something similar, using dorado v1.0.2 and [email protected]. In this model, we find high co-occurrence between m5C and m6A. But we don't see the same co-occurrence with Inosine/PseudoU/Nm. (used modkit pileup with default filtering parameters, and used valid coverage ≥ 10 and percent modified ≥ 25 because this optimized the percentage of m6As called in DRACH motifs)
One way that this confounds analysis is that when you plot the distribution of modifications across the gene regions, you get this pattern:
This is classic m6A distribution. But most reports, like the one here https://www.nature.com/articles/cr201755: show m5C enrichment mainly at the start codon, with little to no enrichment at the stop codon. My first idea was that if we filter out m5C sites adjacent to m6A, this enrichment would go away (as these sites would likely be FP call next to real m6A). The distribution of the m5C sites adjacent to m6A supports this:
However, when you filter out these m6A adjacent m5C calls, we still see a very prominent enrichment at the stop codon
Thus I'm worried that there is a high FP rate among the ~ 500 modifications at the stop codon. Maybe an m6A or other stop-codon enriched modification is triggering the m5C call?