modkit Single sites analysis - too many significant sites and FDR correction

Hello @ArtRand

Thank you for such a nice package. I ran this for my analysis (ignoring positions with low valid coverage (<5X) & considering only effect sizes greater than 10%). I am investigating both 5mC and 5hmC and choose to not combine strands :

modkit dmr pair \
  -a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivA2_pileup.bed.gz \
  -a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivB2_pileup.bed.gz \
  -a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivC2_pileup.bed.gz \
  -a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivD2_pileup.bed.gz \
  -b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivA1_pileup.bed.gz \
  -b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivB1_pileup.bed.gz \
  -b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivC1_pileup.bed.gz \
  -b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivD1_pileup.bed.gz \
  -o /Users/MC/Desktop/UNP_aftervsbefore_4ind_single-base_5X_delta10.txt \
  --header \
  --ref /Users/MC/Downloads/ncbi_dataset/ncbi_dataset/data/GCA_000001405.29/GCA_000001405.29_GRCh38.p14_genomic.fna \
  --delta 0.1 \
  --base C \
  --threads 16 \
  --min-valid-coverage 5 \
  --log-filepath /Users/MC/Desktop/DMC_UNP_aftervsbefore_4ind_single-base_5X_delta10.log

which returned

> reading reference FASTA at "/Users/MC/Downloads/ncbi_dataset/ncbi_dataset/data/GCA_000001405.29/GCA_000001405.29_GRCh38.p14_genomic.fna"
> running single-site analysis
> using default prior, Beta(α: 0.55, β: 0.55)
> estimating max coverages from data
> sampled 1667716 a records and 1346810 b records, calculating max coverages for 95th percentile
> calculated max coverage for a: 17 and b: 16
> min valid coverage set to 5
> running with replicates and matched samples
> finished, processed 52301679 sites successfully, 0 failed

I further filtered out p-values pv<0.05 and required 75% of samples to be used in statistical test (which corresponds to 3 out of 4 individuals in my case). I end up with 458,258 "significant" CpGs, which is huge I find (I historically worked with WGBS data and the number of significant DMC is in the magnitude of hundreds). I should add that setting up pv<=0.01 returns only 73,964 significant CpGs. My volcano plot looks normal (please see below) and when I look at the CpGs individually in IGV, they are indeed significant.

Should I consider adding extra filters to reduce the amount of significant sites?
Should I consider a greater effect size and a pv<0.01 (represented by the vertical and horizontal lines)?
Should I increase the coverage per cytosine during dmr pair? What do you think?

I read here that you found that the Benjamini-Hochberg procedure works pretty well with single site analysis. So I calculated FDR on the 458,258 significant pvalues by doing UNP$FDR <- p.adjust(UNP$map_pvalue, method = "BH") and ended up with the exact same FDR for all of them = 0.04993573. Since I did 458,258 tests and 0.04993573*458,258 = 22,883 does this mean that I should expect 22,883 false discoveries - which corresponds to 1/20th of the sites. Again, is this something I should expect with Nanopore data? I would be most grateful if you could share your best practices for FDR correction.

Mar 18 '25 10:03 Machadum

Hello @Machadum,

Sorry about the delayed response.

A couple things I may adjust:

I usually use a p-value of 0.01 (the default in the segmenter), but others have found that this might be too low.
An effect size of 10% might be too low, I would only look at effect sizes >30% or so (depending on what you're doing).

ended up with the exact same FDR for all of them = 0.04993573

This might be a calculation error, but I'm not totally familiar with that R function (or what data you passed in). Let me get a worked example together and share it.

Since I did 458,258 tests and 0.04993573*458,258 = 22,883 does this mean that I should expect 22,883 false discoveries - which corresponds to 1/20th of the sites

Yes this is exactly what this means assuming 458_258 sites are below the BH critical value.

Mar 20 '25 03:03 ArtRand

Thank you for your answer. For effect sizes > 30%, is it better to run dmr pair on single sites without any filter and then filter out column 16/effect size or to run dmr pair with --delta 0.3? Looking forward to your worked example.

Mar 20 '25 08:03 Machadum

@Machadum,

Thank you for your answer. For effect sizes > 30%, is it better to run dmr pair on single sites without any filter and then filter out column 16/effect size or to run dmr pair with --delta 0.3?

It doesn't really matter, --delta defines the region of practical equivalence and effect sizes less than this will be given a MAP-based p-value of 1. You could easily perform this transformation post hoc, really it's up to you, the intention is to save compute time if you have a lot of sites that have small changes.

Mar 20 '25 21:03 ArtRand