Single sites analysis - too many significant sites and FDR correction
Hello @ArtRand
Thank you for such a nice package. I ran this for my analysis (ignoring positions with low valid coverage (<5X) & considering only effect sizes greater than 10%). I am investigating both 5mC and 5hmC and choose to not combine strands :
modkit dmr pair \
-a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivA2_pileup.bed.gz \
-a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivB2_pileup.bed.gz \
-a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivC2_pileup.bed.gz \
-a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivD2_pileup.bed.gz \
-b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivA1_pileup.bed.gz \
-b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivB1_pileup.bed.gz \
-b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivC1_pileup.bed.gz \
-b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivD1_pileup.bed.gz \
-o /Users/MC/Desktop/UNP_aftervsbefore_4ind_single-base_5X_delta10.txt \
--header \
--ref /Users/MC/Downloads/ncbi_dataset/ncbi_dataset/data/GCA_000001405.29/GCA_000001405.29_GRCh38.p14_genomic.fna \
--delta 0.1 \
--base C \
--threads 16 \
--min-valid-coverage 5 \
--log-filepath /Users/MC/Desktop/DMC_UNP_aftervsbefore_4ind_single-base_5X_delta10.log
which returned
> reading reference FASTA at "/Users/MC/Downloads/ncbi_dataset/ncbi_dataset/data/GCA_000001405.29/GCA_000001405.29_GRCh38.p14_genomic.fna"
> running single-site analysis
> using default prior, Beta(α: 0.55, β: 0.55)
> estimating max coverages from data
> sampled 1667716 a records and 1346810 b records, calculating max coverages for 95th percentile
> calculated max coverage for a: 17 and b: 16
> min valid coverage set to 5
> running with replicates and matched samples
> finished, processed 52301679 sites successfully, 0 failed
I further filtered out p-values pv<0.05 and required 75% of samples to be used in statistical test (which corresponds to 3 out of 4 individuals in my case). I end up with 458,258 "significant" CpGs, which is huge I find (I historically worked with WGBS data and the number of significant DMC is in the magnitude of hundreds). I should add that setting up pv<=0.01 returns only 73,964 significant CpGs. My volcano plot looks normal (please see below) and when I look at the CpGs individually in IGV, they are indeed significant.
- Should I consider adding extra filters to reduce the amount of significant sites?
- Should I consider a greater effect size and a pv<0.01 (represented by the vertical and horizontal lines)?
- Should I increase the coverage per cytosine during dmr pair? What do you think?
I read here that you found that the Benjamini-Hochberg procedure works pretty well with single site analysis. So I calculated FDR on the 458,258 significant pvalues by doing UNP$FDR <- p.adjust(UNP$map_pvalue, method = "BH") and ended up with the exact same FDR for all of them = 0.04993573. Since I did 458,258 tests and 0.04993573*458,258 = 22,883 does this mean that I should expect 22,883 false discoveries - which corresponds to 1/20th of the sites. Again, is this something I should expect with Nanopore data? I would be most grateful if you could share your best practices for FDR correction.
Hello @Machadum,
Sorry about the delayed response.
A couple things I may adjust:
- I usually use a p-value of 0.01 (the default in the segmenter), but others have found that this might be too low.
- An effect size of 10% might be too low, I would only look at effect sizes >30% or so (depending on what you're doing).
ended up with the exact same FDR for all of them = 0.04993573
This might be a calculation error, but I'm not totally familiar with that R function (or what data you passed in). Let me get a worked example together and share it.
Since I did 458,258 tests and 0.04993573*458,258 = 22,883 does this mean that I should expect 22,883 false discoveries - which corresponds to 1/20th of the sites
Yes this is exactly what this means assuming 458_258 sites are below the BH critical value.
Thank you for your answer. For effect sizes > 30%, is it better to run dmr pair on single sites without any filter and then filter out column 16/effect size or to run dmr pair with --delta 0.3? Looking forward to your worked example.
@Machadum,
Thank you for your answer. For effect sizes > 30%, is it better to run dmr pair on single sites without any filter and then filter out column 16/effect size or to run dmr pair with --delta 0.3?
It doesn't really matter, --delta defines the region of practical equivalence and effect sizes less than this will be given a MAP-based p-value of 1. You could easily perform this transformation post hoc, really it's up to you, the intention is to save compute time if you have a lot of sites that have small changes.