modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Are all CpG reported ?

Open OceaneMion opened this issue 1 year ago • 7 comments

Hi I would like to know if all CpG are reported in the modkit pileup if we specify no-filtering ? Even those that are not methylated ?

OceaneMion avatar May 28 '24 12:05 OceaneMion

Hello @OceaneMion,

Positions with $N_{\text{valid}}$ will not get bedMethyl records, so if you have set the --no-filtering flag, CpGs without any read coverage will not be reported, as documented. One thing to keep in mind depending on your experiment, assuming that it is modified or canonical when it does not have any read coverage will probably lead to incorrect results. In fact, off the top of my head I'm having a hard time coming up with a situation where this kind of imputation would make sense.

However if you want to make a BED table with all CpGs, you can use modkit motif-bed and bedtools.


$ modkit pileup ${mod_bam} ${bedmethyl_cpg} --cpg --ref ${ref} --no-filtering 
$ modkit modif-bed ${ref} CG 0 > cpgs.bed
$ bedtools intersect -a cpgs.bed -b ${bedmethyl_cpg} -loj -wb

ArtRand avatar May 28 '24 16:05 ArtRand

Thanks a lot ! I think it is Indeed better to keep only the cpg with coverage. What I would like to know also is how to get a table with the number of potentially methylated site with 5hmC, then the one with 5mC, and the total number of cpg that have coverage, at the level of the assembly and not reads ? Because I only see it at the level of reads

OceaneMion avatar May 28 '24 18:05 OceaneMion

Hello @OceaneMion,

What I would like to know also is how to get a table with the number of potentially methylated site with 5hmC, then the one with 5mC, and the total number of cpg that have coverage, at the level of the assembly and not reads?

These are the steps

# step 1: create a pileup
$ modkit pileup ${mod_bam} ${bedmethyl_cpg} --cpg --ref ${ref}`

# step 2: separate 5mC from 5hmC
$ awk '$4=="m"' ${bedmethyl_cpg} > ${pileup_5mC}
$ awk '$4=="h"' ${bedmethyl_cpg} > ${pileup_5hmC}

# alternate step 2, if you want only sites with >0% modification
$ awk '($4=="m") && ($11>0.0)' ${bedmethyl_cpg} > ${pileup_5mC_only_mod}
$ awk '($4=="h") && ($11>0.0)' ${bedmethyl_cpg} > ${pileup_5hmC_only_mod}

# step 3: calculate total number of sites with coverage
$ awk '$4=="m"' ${bedmethyl_cpg} | wc -l 
# this should be the same number as
$ awk '$4=="h"' ${bedmethyl_cpg} | wc -l 

ArtRand avatar May 28 '24 22:05 ArtRand

Thanks for your awnser yes if I do not put a threshold therefore I will have the same number of 5hmC and 5mC here, which is then more accurate to use reads for the global methylation level and not the assembly. I have another question, what kind of statistical test will you use to see if methylation at CpG sites is random or not, do you think Kolmogorov-Smirnov is appropriated ?

OceaneMion avatar Jun 05 '24 11:06 OceaneMion

@OceaneMion

have another question, what kind of statistical test will you use to see if methylation at CpG sites is random or not, do you think Kolmogorov-Smirnov is appropriated ?

Do you mean you want a test to see if the methylation at a CpG is not uniform over the potential classes? I.e. P(canonical) = P(5hmC) = P(5mC) = 1/3?

ArtRand avatar Jun 05 '24 14:06 ArtRand

Hi, I have a kind of similar question, on the bedmethyl file, I have reads that have both MC and hMC at 0. So, are all CpG (even not methylated ones) reported ? I am using nextflow human_variation workflow with the -mod flag. Many thanks.

selmapichot avatar Jun 20 '24 09:06 selmapichot

Hello @selmapichot,

Sorry I missed this question. All CpGs that have modified base information (including canonical calls) will be included in the bedMethyl as long as they pass the pass threshold. If all of the calls fail to pass this threshold there won't be a bedMethyl record. You can bypass filtering with --no-filtering however the accuracy will be lower.

ArtRand avatar Jun 25 '24 21:06 ArtRand