High number of outliers in DMR analysis with `modkit dmr pair`
Hi everyone,
I've been using modkit dmr pair to compare methylation between a control and a mutant sample for a specific motif. Here's the command I used:
modkit dmr pair \
-a "$FILE_A" \
-b "$FILE_B" \
-o "$OUTPUT" \
--ref "$REF_GENOME" \
--base A \
--min-valid-coverage 10 \
--segment "$SEGMENTS" \
--fine-grained \
--threads 32 \
--log-filepath "$LOG" \
--header
Here's a sample of the output:
I filtered the results using map_pvalue < 0.05 to keep only statistically significant entries. Then, I plotted a_pct_modified vs b_pct_modified using boxplots to visualize the differences:
However, I'm still seeing a lot of outliers in the boxplots. I'm wondering:
-
Is this expected behavior, or could I be doing something wrong?
-
Could it be due to low coverage? The maximum coverage I have in both samples is 27. Is that considered too low for reliable DMR calling?
Any advice or suggestions would be appreciated! Thanks in advance.
Hello @CrisDAndres,
Sorry for the terribly long response time (new Modkit features are coming!).
Nothing you're doing stands out as incorrect, but here are some considerations.
You may want to perform a multiple testing correction on the results. Others have done this and suggested Benjamini–Hochberg procedure, but you have to perform this yourself. I'd like to add some functions to modkit dmr that do some of this correction post hoc.
The outliers could be positions with higher coverage, so you're seeing a significant change, but a small change. Perhaps these positions are a mapping artifact?
Low coverage would usually result in fewer significant sites.
Hi @ArtRand ,
Thank you for your feedback and helpful suggestions. I really appreciate the insights.
Regarding the multiple testing correction, I have implemented the Benjamini–Hochberg procedure as you suggested. The results seem quite robust, especially since I am comparing a wild-type strain with a strain that has had a gene deleted, and a clear difference is evident between the two. I used a threshold of FDR < 0.05, and here are the key results:
- Significant results: 18,474 out of 18,569
- Average FDR: 0.000344
- False positives: 6.33
I see that the FDR is very low, which I guess I can say that the results are quite reliable. However, I’ll keep in mind your point about potential outliers being due to higher coverage or mapping artifacts and take another look at those regions to make sure I’m not missing anything.
Besides the Benjamini–Hochberg correction, would you recommend making any adjustments to the effect size to further refine the analysis?
Thanks again for your help and insights!