dorado icon indicating copy to clipboard operation
dorado copied to clipboard

m6A and Inosine Overlap in Basecalling Results

Open Taylorain opened this issue 9 months ago • 5 comments

Dear Dorado Developers,

I am using Dorado for basecalling and detecting RNA modifications in my DRS data. Here are the commands I used:

dorado basecaller \
-x cuda:0 \
--estimate-poly-a \
-vv \
--emit-moves \
--mm2-opts "-x lr:hq -k 15 --secondary=no" \
--min-qscore 7 \
--reference sup_drs/7.drs_reference/1.3.strg_305.agat.fa \
--modified-bases-models [email protected]_inosine_m6A@v1 \
[email protected] \
/LM_0_1+2+3/pod5_pass/ \
> 1.basecall/0_A.inosine.bam

dorado basecaller \
-x cuda:0 \
--estimate-poly-a \
--verbose \
--emit-moves \
--mm2-opts "-x lr:hq -k 15 --secondary=no" \
--min-qscore 7 \
--reference sup_drs/7.drs_reference/1.3.strg_305.agat.fa \
--modified-bases-models [email protected]_m5C@v1,[email protected]_m6A_DRACH@v1,[email protected]_pseU@v1 \
[email protected] \
/LM_0_1+2+3/pod5_pass/ \
> 1.basecall.bam/0_A.bam

However, when aligning the results to the transcriptome, I noticed that the identified m6A modification sites differed between runs. I also observed an overlap between m6A and Inosine sites when extracting modifications using Modkit. Additionally, about 5% of the detected modification sites were found on bases other than A.

To refine my analysis, I removed sites where the number of modified reads was zero and extracted methylation information into two files:

$ wc -l 05.diff.txt
420561 05.diff.txt
$ wc -l 05.match.txt
7695525 05.match.txt

$ head 05.diff.txt
TCONS_00000835_419 17596 1 1 T
TCONS_00000835_663 a 1 1 T
TCONS_00000835_668 a 1 1 T
TCONS_00000835_824 17596 1 1 T
TCONS_00000835_918 a 1 1 T
TCONS_00000835_945 a 1 1 T
TCONS_00000835_974 17596 1 1 G

I have two main questions:

  1. Could the [email protected]_inosine_m6A@v1 model still be immature or require further validation?
  2. Would it be reasonable to exclude Inosine sites that overlap with m6A sites detected using the [email protected]_m6A_DRACH@v1 model?

Any insights into why Inosine and m6A sites might overlap and whether my filtering approach is appropriate would be greatly appreciated.

Thank you for your time!

Taylorain avatar Apr 02 '25 07:04 Taylorain

Hello @Taylorain,

  1. Could the [email protected]_inosine_m6A@v1 model still be immature or require further validation?

We have recently published a blog post with synthetic ground-truth strands that you can download and use to validate the accuracy of the models.

  1. Would it be reasonable to exclude Inosine sites that overlap with m6A sites detected using the [email protected]_m6A_DRACH@v1 model?

The m6A_DRACH model will never predict Inosine, obviously, so you can "ignore" the Inosine probabilities to change the [email protected]_inosine_m6A@v1 probabilities into m6A/canonical probabilities with the modkit commands like this (also in the blog):

modkit \
    adjust-mods \
    --ignore 17596 \
    1.basecall/0_A.inosine.bam \
    1.basecall/0_A.inosine_ignored.bam

You can refer to the blog to inspect the resultant 2-class performance. Whether or not to do this really depends on what you're trying to do. The --ignore option is basically stating that you have a very strong belief that the base modification being ignored does not exist in your sample.

Another option is to subset the m6A/Inosine calls to just DRACH motifs and use the filter thresholds to only keep high confidence calls such as:

modkit adjust-mods --motif DRACH 2 \
    1.basecall/0_A.inosine.bam \
    stdout \
  | modkit call-mods stdin 1.basecall/0_A.inosine_call_mods_drach2.bam --filter-threshold 0.9

ArtRand avatar Apr 02 '25 17:04 ArtRand

Sorry, my previous description might not have been very clear.

  1. When using the [email protected]_inosine_m6A@v1 model, I observed that both inosine (I) and m6A modifications were detected at the same site.
  2. I also detected these two modifications on G, C, and T bases.
  3. The m6A sites detected by the [email protected]_m6A_DRACH@v1 model were identified as inosine (I) modifications at the same positions when using the [email protected]_inosine_m6A@v1 model.

Is this result expected?

Taylorain avatar Apr 03 '25 00:04 Taylorain

@Taylorain,

  1. When using the [email protected]_inosine_m6A@v1 model, I observed that both inosine (I) and m6A modifications were detected at the same site.

There will be predictions for both inosine and m6A at the same site but they will be normalised probabilities of each mod with any remainder indicating no-mod.

  1. I also detected these two modifications on G, C, and T bases.

These will be basecalling errors where you have a B to A substitution - the mod calls are made on all motif hits (A's) so an attempt will be made at this substitution regardless.

  1. The m6A sites detected by the [email protected]_m6A_DRACH@v1 model were identified as inosine (I) modifications at the same positions when using the [email protected]_inosine_m6A@v1 model.

As above, the m6A_DRACH model will mod call at all DRACH motifs to predict the m6A mod probability at the A. The inosine_m6A model will mod call at all A's and you will see different probabilities are they're different models.

Best regards, Rich

HalfPhoton avatar Apr 03 '25 14:04 HalfPhoton

samtools view -@ 80 -q 1 -b 0_A.bam | samtools sort -@ 80 -o 0_A.sorted.bam
samtools view -@ 80 -q 1 -b 0_A.inosine.bam | samtools sort -@ 80 -o 0_A.inosine.sorted.bam

I generated two BAM files after basecalling using the scripts mentioned above. When using the [email protected]_m6A_DRACH@v1 model with the following command:

modkit pileup -t 64 --filter-threshold C:0.8378906 --filter-threshold A:0.9277344 --filter-threshold T:0.8535156 0_A.sorted.bam 01.0_A.pileup.bed --log-filepath 01.0_A.pileup.log

The results for position 2449 of TCONS_00000753 were as follows:

TCONS_00000753  2449    2450    a       380     +       2449    2450    255,0,0 380     1.32    5       375     0       22      1236    559     385  
TCONS_00000753  2449    2450    m       6       +       2449    2450    255,0,0 6       0.00    0       6       0       22      1236    1318    0  
TCONS_00000753  2449    2450    17802   3       +       2449    2450    255,0,0 3       0.00    0       3       0       22      1236    1321    0  

However, when using the [email protected]_inosine_m6A@v1 model with the following command:

modkit pileup -t 64 --filter-threshold A:0.9277344 0_A.inosine.sorted.bam 01.0_A.pileup.bed --log-filepath 01.0_A.pileup.log  

The results for the same position were:

TCONS_00000753  2449    2450    a       1644    +       2449    2450    255,0,0 1644    0.55    9       225     1410    22      356     560     0  
TCONS_00000753  2449    2450    17596   1644    +       2449    2450    255,0,0 1644    85.77   1410    225     9       22      356     560     0  

As shown above, at position 2449 of TCONS_00000753, the total number of reads differs significantly: 380 in one case and 1644 in the other. Given that both BAM files underwent the same sorting and filtering steps, I would like to understand why the read counts are different at the same position when using these two models.

Taylorain avatar Apr 03 '25 15:04 Taylorain

I have checked these two BAM files and confirmed that the read alignment information is identical.

  • Extracting basic alignment information:

    0_A.bam | awk '{print $1, $3, $4, $6}' > isoform_reads.txt
    0_A.inosine.bam | awk '{print $1, $3, $4, $6}' > inosine_reads.txt
    
  • Sorting and removing duplicates:

    sort isoform_reads.txt | uniq > isoform_reads.sorted.txt
    sort inosine_reads.txt | uniq > inosine_reads.sorted.txt
    
  • Comparing intersections and differences:

    comm -12 isoform_reads.sorted.txt inosine_reads.sorted.txt > shared_reads.txt
    comm -23 isoform_reads.sorted.txt inosine_reads.sorted.txt > only_in_isoform.txt
    comm -13 isoform_reads.sorted.txt inosine_reads.sorted.txt > only_in_inosine.txt
    

results

    $ wc -l *
      15753126 inosine_reads.sorted.txt
      15753126 inosine_reads.txt
      15753126 isoform_reads.sorted.txt
      15753126 isoform_reads.txt
             0 nohup.out
             0 only_in_inosine.txt
             0 only_in_isoform.txt
      15753126 shared_reads.txt

Taylorain avatar Apr 04 '25 05:04 Taylorain