dorado icon indicating copy to clipboard operation
dorado copied to clipboard

Seeing modified bases in an in-vitro transcribed RNA

Open kamaloxfordpathology opened this issue 8 months ago • 6 comments

Hi,

Hope you are well. I have an invitro transcribed RNA without any RNA modification writers in the reaction. To my surprise, I see a lot of RNA modification calls with probability of modification calls above 90 percent. For example, I have attached an IGV snapshot of m5C calls with threshold visibility set at 0.6 in the IGV. Kindly help me understand this. Thanks

Image

kamaloxfordpathology avatar Apr 20 '25 12:04 kamaloxfordpathology

Hello @kamaloxfordpathology,

Could you run modkit summary $bam with the automatic threshold detection and the --only-mapped flag and let me know what you get? You may also want to try --mod-threshold m:0.9. Feel free to move this discussion over to the Modkit repo by the way.

We've recently published a blog post that describes the performance of the RNA base modification models on synthetic RNA strands with base modifications at known positions as well as the data and methods. In figure 4a of that article, there is confusion matrix showing that the estimated false positive rate of the m5C model is 0.92%.

It's hard to "eyeball" what the FPR would be in your screen shot, but the output from Modkit should make it easier to diagnose what's going on.

ArtRand avatar Apr 22 '25 22:04 ArtRand

Hi @ArtRand ,

Thanks for looking into this issue. I ran the modkit summary as suggested. This was the output.

 bases             A,C,T
 total_reads_used  137807
 count_reads_T     137807
 count_reads_C     137807
 count_reads_A     137807
 pass_threshold_T  0.6894531
 pass_threshold_A  0.6171875
 pass_threshold_C  0.6074219
 base  code   pass_count  pass_frac    all_count  all_frac
 T     -      1524140     0.97000086   1636993    0.93775237
 T     17802  47137       0.029999167  108663     0.06224766
 A     -      1044031     0.9339467    1119095    0.901614
 A     a      57378       0.051327974  92544      0.07455932
 A     17596  16461       0.014725326  29574      0.023826692
 C     -      1489902     0.90810806   1589027    0.8747711
 C     m      150764      0.09189195   227479     0.12522887

kamaloxfordpathology avatar Apr 23 '25 07:04 kamaloxfordpathology

Hello @kamaloxfordpathology, thanks let me circulate this with my team. Could you run the same command with --mod-threshold m:0.9 at your convenience? Also could you tell me if you're using HAC or SUP basecalling/modification-detection?

ArtRand avatar Apr 29 '25 01:04 ArtRand

Hi @ArtRand ,

Thanks for the reply. I have used HAC models. I ran the modkit summary command with threshold for all modifications set at 0.7, 0.8 ,0.9 and restricted it to all reads present in the region shown in the IGV snapshot. Here are the outputs. Kindly ignore the previous modkit summary output since I did not restrict it to the in-vitro transcribed region. Thanks!

0.7 filter threshold

 bases             C,T,A
 total_reads_used  13586
 count_reads_A     13586
 count_reads_T     13586
 count_reads_C     13586
 base  code   pass_count  pass_frac    all_count  all_frac
 A     -      424458      0.94891024   476234     0.9002329
 A     a      17787       0.03976428   41277      0.078026585
 A     17596  5066        0.011325453  11501      0.021740528
 C     -      408812      0.9693759    445183     0.9339771
 C     m      12915       0.030624077  31470      0.06602287
 T     -      434138      0.97291493   458925     0.9473896
 T     17802  12086       0.027085051  25485      0.052610394

0.8 filter threshold

 bases             T,C,A
 total_reads_used  13586
 count_reads_T     13586
 count_reads_A     13586
 count_reads_C     13586
 base  code   pass_count  pass_frac    all_count  all_frac
 C     -      375306      0.98175424   445183     0.9339771
 C     m      6975        0.01824574   31470      0.06602287
 T     -      411891      0.98221064   458925     0.9473896
 T     17802  7460        0.017789394  25485      0.052610394
 A     -      388420      0.9662838    476234     0.9002329
 A     a      10293       0.025606198  41277      0.078026585
 A     17596  3260        0.008109997  11501      0.021740528

0.9 filter threshold

 bases             A,C,T
 total_reads_used  13586
 count_reads_T     13586
 count_reads_C     13586
 count_reads_A     13586
 base  code   pass_count  pass_frac     all_count  all_frac
 T     -      371637      0.9901897     458925     0.9473896
 T     17802  3682        0.009810321   25485      0.052610394
 C     -      319230      0.9910651     445183     0.9339771
 C     m      2878        0.008934892   31470      0.06602287
 A     -      325851      0.9814198     476234     0.9002329
 A     a      4515        0.013598578   41277      0.078026585
 A     17596  1654        0.0049816277  11501      0.021740528

kamaloxfordpathology avatar Apr 29 '25 07:04 kamaloxfordpathology

Hi,

Hope you are well. Is there any update to this problem. I see the presence of modified residues in all our in-vitro transcribed replicates. Hence I do not think it is an issue with the experiment.

kamaloxfordpathology avatar May 14 '25 08:05 kamaloxfordpathology

Hello,

Any news about this thread? I am experiencing the same issue when analysing RCS (RNA004 control strand).

If I am not mistaken, there should be no modifications detected since the strand is produced using an in vitro transcription system, correct?

However, I still observe a noticeable fraction of bases being flagged as modified. Could this be related to model calibration, default thresholds, or something else?

Thanks in advance for your help!

danielcav avatar Sep 02 '25 15:09 danielcav