Are all CpG reported ?
Hi I would like to know if all CpG are reported in the modkit pileup if we specify no-filtering ? Even those that are not methylated ?
Hello @OceaneMion,
Positions with $N_{\text{valid}}$ will not get bedMethyl records, so if you have set the --no-filtering flag, CpGs without any read coverage will not be reported, as documented. One thing to keep in mind depending on your experiment, assuming that it is modified or canonical when it does not have any read coverage will probably lead to incorrect results. In fact, off the top of my head I'm having a hard time coming up with a situation where this kind of imputation would make sense.
However if you want to make a BED table with all CpGs, you can use modkit motif-bed and bedtools.
$ modkit pileup ${mod_bam} ${bedmethyl_cpg} --cpg --ref ${ref} --no-filtering
$ modkit modif-bed ${ref} CG 0 > cpgs.bed
$ bedtools intersect -a cpgs.bed -b ${bedmethyl_cpg} -loj -wb
Thanks a lot ! I think it is Indeed better to keep only the cpg with coverage. What I would like to know also is how to get a table with the number of potentially methylated site with 5hmC, then the one with 5mC, and the total number of cpg that have coverage, at the level of the assembly and not reads ? Because I only see it at the level of reads
Hello @OceaneMion,
What I would like to know also is how to get a table with the number of potentially methylated site with 5hmC, then the one with 5mC, and the total number of cpg that have coverage, at the level of the assembly and not reads?
These are the steps
# step 1: create a pileup
$ modkit pileup ${mod_bam} ${bedmethyl_cpg} --cpg --ref ${ref}`
# step 2: separate 5mC from 5hmC
$ awk '$4=="m"' ${bedmethyl_cpg} > ${pileup_5mC}
$ awk '$4=="h"' ${bedmethyl_cpg} > ${pileup_5hmC}
# alternate step 2, if you want only sites with >0% modification
$ awk '($4=="m") && ($11>0.0)' ${bedmethyl_cpg} > ${pileup_5mC_only_mod}
$ awk '($4=="h") && ($11>0.0)' ${bedmethyl_cpg} > ${pileup_5hmC_only_mod}
# step 3: calculate total number of sites with coverage
$ awk '$4=="m"' ${bedmethyl_cpg} | wc -l
# this should be the same number as
$ awk '$4=="h"' ${bedmethyl_cpg} | wc -l
Thanks for your awnser yes if I do not put a threshold therefore I will have the same number of 5hmC and 5mC here, which is then more accurate to use reads for the global methylation level and not the assembly. I have another question, what kind of statistical test will you use to see if methylation at CpG sites is random or not, do you think Kolmogorov-Smirnov is appropriated ?
@OceaneMion
have another question, what kind of statistical test will you use to see if methylation at CpG sites is random or not, do you think Kolmogorov-Smirnov is appropriated ?
Do you mean you want a test to see if the methylation at a CpG is not uniform over the potential classes? I.e. P(canonical) = P(5hmC) = P(5mC) = 1/3?
Hi, I have a kind of similar question, on the bedmethyl file, I have reads that have both MC and hMC at 0. So, are all CpG (even not methylated ones) reported ? I am using nextflow human_variation workflow with the -mod flag. Many thanks.
Hello @selmapichot,
Sorry I missed this question. All CpGs that have modified base information (including canonical calls) will be included in the bedMethyl as long as they pass the pass threshold. If all of the calls fail to pass this threshold there won't be a bedMethyl record. You can bypass filtering with --no-filtering however the accuracy will be lower.