Pileup - analyzing several mods and several motifs
Hi,
Thanks for all the support with Modkit. I am using Dorado v1.0.0 and Modkit v0.5.0. Based on other issues I have looked at, I have started adding the --motif argument to my pileup commands to avoid looking at reads that have "modified" mismatches. I have 3 questions:
When analyzing all RNA modifications, I have been adding all motif arguments using the following approach:
dist_modkit_v0.5.0_5120ef7/modkit pileup -t 32 \
--mod-threshold a:0.99 --mod-threshold 17802:0.99 --mod-threshold 17596:0.99 --mod-threshold 69426:0.99 \
--mod-threshold m:0.99 --mod-threshold 19229:0.99 --mod-threshold 19227:0.99 --mod-threshold 19228:0.99 \
--motif A 0 --motif T 0 --motif C 0 --motif G 0 \
--ref genome.fa ${name}.bam pileup/${name}_pileup.bed
# Filter for sites that have NvalidCov >= 10 & at least 1 modified read
awk '$5 >= 10 && $11 > 0' pileup/${name}_pileup.bed > pileup/${name}_pileup_filtered.bed
Question 1: Is this a reasonable or correct approach to applying the motif arguments when analyzing all modifications?
With this approach, each modification gets "matched" to all 4 motifs at different frequencies. See the following figure:
Question 2: If the above command is a reasonable approach, is the following filtering logic sufficient to retain modified sites that occur on their appropriate reference base?
filter(modification == "m6A" & motif == "A" |
modification == "m5C" & motif == "C" |
modification == "pseU" & motif == "T" |
modification == "inosine" & motif == "A" |
modification == "2_Ome_A" & motif == "A" |
modification == "2_Ome_C" & motif == "C" |
modification == "2_Ome_G" & motif == "G" |
modification == "2_Ome_U" & motif == "T")
I ask because there is obviously some sort of strand/directionality information that is retained because all modifications in the bar graph have an equal number of A/T motifs and an equal number of C/G motifs. However, the complementary motifs don't seem to encode any different data for the other pileup columns, e.g., A vs T motifs for m6A sites.
Question 3: Using the filtering logic directly above, I have some very rare problematic genomic sites where modifications with complementary reference bases (e.g., 2'O-me U vs 2'O-me U) will occur on the same site and same strand. Do you have any advice on how to deal with these, or why they're popping up?
Example:
Sorry for the long question but thanks so much for the help!
Hello @kylepalos,
Sorry for the delay, I was OOO last week.
Question 1: Is this a reasonable or correct approach to applying the motif arguments when analyzing all modifications?
I don't think you need --motif X 0 where $X = \lbrace A, C, G, T\rbrace$, since this effectively retains every position. What I think you really want is an option that only reports bedMethyl records for modifications that match the reference base. I'm working on some code that prototypes this option, I'll forward it on once I have something to test (it might be tight to get it into the next release).
The modification thresholds you have here are quite high. I imagine that the subsequent filtering will remove contexts where it's unlikely that the base modification calling model will report a modified probability of 0.99. Keep in mind that there will be sequence contexts where the deflection in signal due to a modification might be small, so I would expect the base modification model will have a correspondingly low(er) base modification prediction probability.
I'm nearly done with a feature that will use a different threshold value per-genomic position, one target application is dRNA with high coverage. I'll ping here once it's ready. (That's two features I owe you).
These plots look like a kind of "error model" where you can see the frequency of substitution when there is at least one confidently modified read. Basically, when you use --motif A 0 that only restricts which genomic sites are reported. If a read has a mismatch/substitution at that position it will be reported (as you've seen). Back to what I suggested above, I think you want a --only-matches or similar.
Question 2: If the above command is a reasonable approach, is the following filtering logic sufficient to retain modified sites that occur on their appropriate reference base?
I like the intention of what you have here, but what might be easier is filtering on valid_coverage, looks like you have ~1000x coverage so removing records with valid_coverage < 100 might clean it up a lot. The records that remain with "mismatched" modification might be worth looking at.
Question 3: Using the filtering logic directly above, I have some very rare problematic genomic sites where modifications with complementary reference bases (e.g., 2'O-me U vs 2'O-me U) will occur on the same site and same strand. Do you have any advice on how to deal with these, or why they're popping up?
This looks like a bug or a join-error. Could you share a small bam that overlaps this region (and a link to the reference)?
Hi @ArtRand
Thanks as always for your excellent support and clarity!
I will take some time to better read through your response, but I wanted to provide the problematic region. Here's a GDrive link containing the region and reference. I also attached just the bam/index as a zipped file if you prefer. The reference is Arabidopsis thaliana TAIR10, you can find it here - I don't use the masked assemblies if it matters.
I used the same modkit code and awk step that I noted above.
I'm also going to include minimal R code to reproduce the "problematic" site (and two others.) I maintained the filtering logic from question 2 for better or worse. Because there are three "problematic" sites in this very short region, perhaps it is an annotation or depth issue. Please let me know if you notice any errors on my end or if I can provide any additional information. Thanks again!
library(dplyr)
library(tidyr)
# read in the filtered pileup
# rename the columns of interest
data <- read.table("pileup_filtered.bed",
sep = "\t", header = F) %>%
rename(chrom = 1,
start = 2,
end = 3,
mod_motif = 4,
cov = 5,
strand = 6)
# Parse the modification information and map codes to readable names
data <- data %>%
separate(mod_motif, into = c("code", "motif", "offset"), sep = ",") %>%
# Map modification codes to human-readable modification types
mutate(modification = case_when(code == "a" ~ "m6A",
code == "m" ~ "m5C",
code == "17802" ~ "pseU",
code == "17596" ~ "inosine",
code == "69426" ~ "2_Ome_A",
code == "19228" ~ "2_Ome_C",
code == "19229" ~ "2_Ome_G",
code == "19227" ~ "2_Ome_U")) %>%
# Here, I'm retaining the filtering logic that was in the GitHub issues post
# may not be optimal
filter(modification == "m6A" & motif == "A" |
modification == "m5C" & motif == "C" |
modification == "pseU" & motif == "T" |
modification == "inosine" & motif == "A" |
modification == "2_Ome_A" & motif == "A" |
modification == "2_Ome_C" & motif == "C" |
modification == "2_Ome_G" & motif == "G" |
modification == "2_Ome_U" & motif == "T")
# Find positions where multiple different motifs are detected
problematic <- data %>%
# Create a unique position identifier combining chromosome, start, and end
unite("position", chrom:end, remove = F) %>%
group_by(position, strand) %>%
mutate(n_motifs = n_distinct(motif)) %>%
ungroup() %>%
filter(n_motifs > 1)
@kylepalos sorry for the delay - looking into this.