m6A and Inosine Overlap in Basecalling Results
Dear Dorado Developers,
I am using Dorado for basecalling and detecting RNA modifications in my DRS data. Here are the commands I used:
dorado basecaller \
-x cuda:0 \
--estimate-poly-a \
-vv \
--emit-moves \
--mm2-opts "-x lr:hq -k 15 --secondary=no" \
--min-qscore 7 \
--reference sup_drs/7.drs_reference/1.3.strg_305.agat.fa \
--modified-bases-models [email protected]_inosine_m6A@v1 \
[email protected] \
/LM_0_1+2+3/pod5_pass/ \
> 1.basecall/0_A.inosine.bam
dorado basecaller \
-x cuda:0 \
--estimate-poly-a \
--verbose \
--emit-moves \
--mm2-opts "-x lr:hq -k 15 --secondary=no" \
--min-qscore 7 \
--reference sup_drs/7.drs_reference/1.3.strg_305.agat.fa \
--modified-bases-models [email protected]_m5C@v1,[email protected]_m6A_DRACH@v1,[email protected]_pseU@v1 \
[email protected] \
/LM_0_1+2+3/pod5_pass/ \
> 1.basecall.bam/0_A.bam
However, when aligning the results to the transcriptome, I noticed that the identified m6A modification sites differed between runs. I also observed an overlap between m6A and Inosine sites when extracting modifications using Modkit. Additionally, about 5% of the detected modification sites were found on bases other than A.
To refine my analysis, I removed sites where the number of modified reads was zero and extracted methylation information into two files:
$ wc -l 05.diff.txt
420561 05.diff.txt
$ wc -l 05.match.txt
7695525 05.match.txt
$ head 05.diff.txt
TCONS_00000835_419 17596 1 1 T
TCONS_00000835_663 a 1 1 T
TCONS_00000835_668 a 1 1 T
TCONS_00000835_824 17596 1 1 T
TCONS_00000835_918 a 1 1 T
TCONS_00000835_945 a 1 1 T
TCONS_00000835_974 17596 1 1 G
I have two main questions:
- Could the
[email protected]_inosine_m6A@v1model still be immature or require further validation? - Would it be reasonable to exclude Inosine sites that overlap with m6A sites detected using the
[email protected]_m6A_DRACH@v1model?
Any insights into why Inosine and m6A sites might overlap and whether my filtering approach is appropriate would be greatly appreciated.
Thank you for your time!
Hello @Taylorain,
- Could the [email protected]_inosine_m6A@v1 model still be immature or require further validation?
We have recently published a blog post with synthetic ground-truth strands that you can download and use to validate the accuracy of the models.
- Would it be reasonable to exclude Inosine sites that overlap with m6A sites detected using the [email protected]_m6A_DRACH@v1 model?
The m6A_DRACH model will never predict Inosine, obviously, so you can "ignore" the Inosine probabilities to change the [email protected]_inosine_m6A@v1 probabilities into m6A/canonical probabilities with the modkit commands like this (also in the blog):
modkit \
adjust-mods \
--ignore 17596 \
1.basecall/0_A.inosine.bam \
1.basecall/0_A.inosine_ignored.bam
You can refer to the blog to inspect the resultant 2-class performance. Whether or not to do this really depends on what you're trying to do. The --ignore option is basically stating that you have a very strong belief that the base modification being ignored does not exist in your sample.
Another option is to subset the m6A/Inosine calls to just DRACH motifs and use the filter thresholds to only keep high confidence calls such as:
modkit adjust-mods --motif DRACH 2 \
1.basecall/0_A.inosine.bam \
stdout \
| modkit call-mods stdin 1.basecall/0_A.inosine_call_mods_drach2.bam --filter-threshold 0.9
Sorry, my previous description might not have been very clear.
- When using the
[email protected]_inosine_m6A@v1model, I observed that both inosine (I) and m6A modifications were detected at the same site. - I also detected these two modifications on G, C, and T bases.
- The m6A sites detected by the
[email protected]_m6A_DRACH@v1model were identified as inosine (I) modifications at the same positions when using the[email protected]_inosine_m6A@v1model.
Is this result expected?
@Taylorain,
- When using the [email protected]_inosine_m6A@v1 model, I observed that both inosine (I) and m6A modifications were detected at the same site.
There will be predictions for both inosine and m6A at the same site but they will be normalised probabilities of each mod with any remainder indicating no-mod.
- I also detected these two modifications on G, C, and T bases.
These will be basecalling errors where you have a B to A substitution - the mod calls are made on all motif hits (A's) so an attempt will be made at this substitution regardless.
- The m6A sites detected by the [email protected]_m6A_DRACH@v1 model were identified as inosine (I) modifications at the same positions when using the [email protected]_inosine_m6A@v1 model.
As above, the m6A_DRACH model will mod call at all DRACH motifs to predict the m6A mod probability at the A. The inosine_m6A model will mod call at all A's and you will see different probabilities are they're different models.
Best regards, Rich
samtools view -@ 80 -q 1 -b 0_A.bam | samtools sort -@ 80 -o 0_A.sorted.bam
samtools view -@ 80 -q 1 -b 0_A.inosine.bam | samtools sort -@ 80 -o 0_A.inosine.sorted.bam
I generated two BAM files after basecalling using the scripts mentioned above.
When using the [email protected]_m6A_DRACH@v1 model with the following command:
modkit pileup -t 64 --filter-threshold C:0.8378906 --filter-threshold A:0.9277344 --filter-threshold T:0.8535156 0_A.sorted.bam 01.0_A.pileup.bed --log-filepath 01.0_A.pileup.log
The results for position 2449 of TCONS_00000753 were as follows:
TCONS_00000753 2449 2450 a 380 + 2449 2450 255,0,0 380 1.32 5 375 0 22 1236 559 385
TCONS_00000753 2449 2450 m 6 + 2449 2450 255,0,0 6 0.00 0 6 0 22 1236 1318 0
TCONS_00000753 2449 2450 17802 3 + 2449 2450 255,0,0 3 0.00 0 3 0 22 1236 1321 0
However, when using the [email protected]_inosine_m6A@v1 model with the following command:
modkit pileup -t 64 --filter-threshold A:0.9277344 0_A.inosine.sorted.bam 01.0_A.pileup.bed --log-filepath 01.0_A.pileup.log
The results for the same position were:
TCONS_00000753 2449 2450 a 1644 + 2449 2450 255,0,0 1644 0.55 9 225 1410 22 356 560 0
TCONS_00000753 2449 2450 17596 1644 + 2449 2450 255,0,0 1644 85.77 1410 225 9 22 356 560 0
As shown above, at position 2449 of TCONS_00000753, the total number of reads differs significantly: 380 in one case and 1644 in the other. Given that both BAM files underwent the same sorting and filtering steps, I would like to understand why the read counts are different at the same position when using these two models.
I have checked these two BAM files and confirmed that the read alignment information is identical.
-
Extracting basic alignment information:
0_A.bam | awk '{print $1, $3, $4, $6}' > isoform_reads.txt 0_A.inosine.bam | awk '{print $1, $3, $4, $6}' > inosine_reads.txt -
Sorting and removing duplicates:
sort isoform_reads.txt | uniq > isoform_reads.sorted.txt sort inosine_reads.txt | uniq > inosine_reads.sorted.txt -
Comparing intersections and differences:
comm -12 isoform_reads.sorted.txt inosine_reads.sorted.txt > shared_reads.txt comm -23 isoform_reads.sorted.txt inosine_reads.sorted.txt > only_in_isoform.txt comm -13 isoform_reads.sorted.txt inosine_reads.sorted.txt > only_in_inosine.txt
results
$ wc -l *
15753126 inosine_reads.sorted.txt
15753126 inosine_reads.txt
15753126 isoform_reads.sorted.txt
15753126 isoform_reads.txt
0 nohup.out
0 only_in_inosine.txt
0 only_in_isoform.txt
15753126 shared_reads.txt