Best practices to filter high confidence m6A in dRNA experiment
First of all, thank you for this amazing tool! I am currently working on m⁶A analysis in a direct RNA sequencing (dRNA-seq) experiment.
I basecalled my data using Dorado v1.0.0 and used Modkit v5.0.0 for the subsequent analysis.
After sorting my modBAM file, I ran modkit summary and obtained the following:
$PATH_modkit./modkit summary $Run01_sup_all_mods_v1_0_0_bam
> sampling 10042 reads from BAM
> calculating threshold at 10(th) percentile
> calculated thresholds: C: 0.859375 G: 0.8808594 T: 0.9453125 A: 0.6738281
# bases C,G,T,A
# total_reads_used 10042
# count_reads_A 10012
# count_reads_G 9470
# count_reads_C 8298
# count_reads_T 8396
# pass_threshold_G 0.8808594
# pass_threshold_T 0.9453125
# pass_threshold_A 0.6738281
# pass_threshold_C 0.859375
base code pass_count pass_frac all_count all_frac
A a 74468 0.01548 147522 0.02767
I am currently only interested in m⁶A sites in all sequence contexts, and 74,468 sounds like a reasonable number. To obtain the positions of these sites, I ran:
$PATH_modkit/modkit pileup $Run01_sup_all_mods_v1_0_0_bam $output_bed --ref $ref --log-filepath $log --threads 48
However, the resulting BED file contains many more positions than the 74,468 m⁶A sites reported in the summary (counting only the lines where the 4th column is "a").
I'm a bit confused by this discrepancy.
To try to filter the data, I applied the following thresholds:
Valid coverage (12th column) ≥ 5
% methylation (11th column) ≥ 90.00
but I am not sure this is the correct or recommended way to filter the data.
Apologies for the long question, and thank you in advance for your help!
Hello @victorLLA,
The modkit summary command works on the read level, so it's counting the number of base modification calls on each RNA molecule sequenced. On the other hand, pileup aggregates the calls from multiple reads at each genomic/transcriptomic position. That being said, pileup will also report positions where there at least one read has a B>A (B={G,C,T}) mismatch, you probably don't want these records. One way to remove them is to pass --motif A 0 to the pileup command, then the records should only be at reference A positions. You still may get records for other mods, however (e.g. when there is a A>B mismatch and a modification call on the mismatched base). You can filter these records out by requiring that the 4th column is "a".
If you already have the pileup, I would make a BED file of the A-positions with modkit motif bed then use bedtools intersect to subset the pileup to just these positions. You may, of course, re-run the command with --motif A 0, your call.
Thanks for the quick response!
I decided to re-run the command using --motif A 0 and compare the output to my previous pileup, which included all modifications:
$PATH_modkit/modkit pileup $Run01_sup_all_mods_v1_0_0_bam $output_bed --ref $ref --log-filepath $log --threads 48 --motif A 0
After some quick filtering (keeping only rows where the 4th column is "a"):
> bed_filtered <- bed[grepl("a", V4)]
> count(bed_filtered)
n
<int>
1: 65156144
I obtain ~65 million m⁶A candidate positions. If I apply a filter for valid coverage (12th column ≥ 5), the number drops to 566,135.
In contrast, using the same command without --motif A 0:
$PATH_modkit/modkit pileup $Run01_sup_all_mods_v1_0_0_bam $output_bed --ref $ref --log-filepath $log --threads 48
And applying the same "a" filter in column 4:
> bed_filtered <- bed[grepl("a", V4)]
> count(bed_filtered)
n
<int>
1: 10752645
I get around 10.7 million m⁶A candidates, which drops to 68,069 after filtering for valid coverage ≥ 5. This number seems more reasonable to me.
My question is :
Why do I get more candidates (~65M) with --motif A 0 than without it (~10M)? I would expect fewer candidates when using --motif A 0, since that should remove mismatches.
Thanks again for your help!
Hello @victorLLA,
Could you tell me what the estimated pass threshold is for the two runs (w/ and w/o --motif A 0)? It should be near the top of the log (and stderr).
Hello @victorLLA,
Thanks for sharing the logs. My initial throught was that the thresholds would be different when using --motif but they are the same for all of the primary bases.
There might be a bug somewhere, would it be possible for you to share the input BAM with me? I can send you a secure link if you email me: art.rand[at]nanoporetech.com
Thanks