modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Validity of correlation between different RNA modification calls

Open mbosm opened this issue 1 year ago • 3 comments

Hello. I was reviewing some of my recent RNA004 direct RNA seq data to see what modifications I could call from them. I have six replicates here, and I called m6A, m5C, and pseudouridine for all of them. Then I ran modkit to generate a bedgraph for each, and compared them in IGV. I also remove samples from my bedgraph that have less than 10 total reads, to remove noise.

So for m6A, it would be:

dorado basecaller sup,m5C,inosine_m6A,pseU ./pod5/ -r --min-qscore 8 --device cuda:0 --reference reference.fasta | samtools sort > sample.bam modkit pileup sample.bam --reference reference.fasta -t 18 --region REGION --max-depth 1000000 --bedgraph ./ --filter-theshold 0.9 --motif DRACH 2 awk '$5>9' sample_a_DRACH2_positive.bedgraph > sample_m6A.bedgraph

and for pseudouridine, it would be:

modkit pileup sample.bam --reference reference.fasta -t 18 --region REGION --max-depth 1000000 --bedgraph ./ --filter-theshold 0.9 --motif T 0 awk '$5>9' sample_17802_T0_positive.bedgraph > sample_pseudouridine.bedgraph

This works well, but I noticed that on a couple of my DRACH motif sites, I see a pattern like this:

Screenshot from 2024-10-14 17-31-47

I've marked in black three DRACH sites at the top. This is followed by m6A calls in all 6 replicates, then m5C calls in all 6 replicates, and then pseudouridine calls in all 6 replicates.

In two of these three shown DRACH sites, the A, C, and U of the motif are being called with fairly high confidence as m6A, m5C, and pseudouridine, all in a row. This doesn't occur at every high-probability m6A DRACH site, but it occurs a few of them.

I guess what I'm wondering is if this is truly three modifications, right next to each other, or if the basecaller is detecting a current resistance difference in this area, and each of the individual modification-finding algorithms is believing that this modification is the one it is checking for, if that makes sense?

If it's the latter, is there something I can change in dorado or modkit to screen for this? I'd like to get as accurate a modification profile as possible.

mbosm avatar Oct 14 '24 21:10 mbosm

Hello @mbosm,

As you have guessed, these psiU and m5C calls are likely errors due to their proximity to the (real) m6A base in the DRACH motif upstream. We (the modified bases team) are always trying to improve the models, and one such area to improve is the susceptibility of the model to make false modification calls distal to other modifications (we call this "neighbors").

The next version of modkit will have the option to include or exclude modification calls at a per-read level based on read-sequence motifs. That way you could remove all of the pseU and m5C calls overlapping with a DRACH. I'll ping this thread when that functionality is ready.

ArtRand avatar Oct 15 '24 22:10 ArtRand

Hello, just following up on this. We wanted to be sure which modification was actually present at these triple-sites, and so did some further testing using total RNA (all modifications), drug-based modification inhibition (m6A reduced by more than half), synthesized fragments (100% m6A at nucleotides called as m6A) and sequencing of in-vitro transcribed RNA (same sequence, no modifications).

What we saw was:

  1. The m6A in these DRACH motifs are accurate. 0% and 100% m6A modifications were called as expected within 1% margin of error.
  2. The m5C in these DRACH motifs is, as speculated, an incorrect call based on the adjacency to m6A. When m6A changed, m5C changed by a similar ratio.
  3. The pseudouridine in these DRACH motifs is not based on the adjacency to m6A, but seems to still be an incorrect call. Rather than an adjacency issue, it seems to be a sequence motif issue, with incorrect pseudoU calls most often when three or more uridines are next to each other in the RNA. When m6A levels changed, the pseudouridine calls stayed the same. In the in-vitro RNA (no modifications), the pseudouridine calls were still there.

We also saw some modification calling errors within a few bases of the edge of the 5' of RNA reads, but this was only an issue with our in-vitro transcribed samples in which all the edges were the same. Simple setting change to remove edge calling fixed that.

Overall, we were impressed by the accuracy of m6A, but for our research purposes, for m5C and pseudoU, we are utilizing a no-mod sequence as a background subtraction, plus utilizing our m6A-knockdown to detect adjacency m5C issues on top of that.

If there's any pod5 data that would be useful the modified bases team for improving the algorithms, we would be happy to contribute.

mbosm avatar Feb 06 '25 19:02 mbosm

Hi, @mbosm

I also saw this issue in my data, but I did not use (in-vitro transcribed) IVT data for filtering.

For m5C modification, I only filtered the C sites located in the DRACH motifs. I want to know if the m5C mod basecalling is accurate after excluding DRACH motif? Or just like pseudouridine, it still exists in in-vitro RNA(IVT) data.

We (the modified bases team) are always trying to improve the models, and one such area to improve is the susceptibility of the model to make false modification calls distal to other modifications (we call this "neighbors").

Here, the nanoporetech has realized that the occurrence of m6A modification in eventaligns can affect the determination of adjacent ("neighbors") base modification eventaligns.

Or perhaps the reason why m6A can be trained well is because it has a DRACH motif. But other modifications do not have a fixed motifs.

it seems to be a sequence motif issue, with incorrect pseudoU calls most often when three or more uridines are next to each other in the RNA.

Just like "CCCCC" motif for m5C or "TTTTT" motif for pseU , DRACH motif didn't have "AAAAA".

So I don't think the predictive effect of inosine is very good either.

Best wishes, Kirito

kir1to455 avatar Mar 05 '25 15:03 kir1to455