modkit icon indicating copy to clipboard operation
modkit copied to clipboard

modkit dmr function

Open bebert30 opened this issue 1 year ago • 4 comments

Hello. Thank you for creating this useful tool! I called mC and hmC bases, aligned the .bam files, sorted and indexed, and then created bed files. Then I wanted to perform modkit dmr pair on my two samples and ran into memory issues so I broke up my .bed files into individual chromosomes. However, even getting through just the first chromosome has taken over 24 hours. Do I need to also split the ref into chromosomes? Is this an expected length of time? I am worried something may have gone wrong, and I am curious to know how the modkit dmr pair function is actually performing dmr analysis? Thank you

bebert30 avatar Feb 07 '24 16:02 bebert30

Hello @bebert30,

Yes I would split the reference into chromosomes. Although this will only help the memory usage - not the processing time. What you can do to decrease processing time is oversubscribe the number of threads. For example if you have 8 threads available set --threads 40. The program is heavily IO-bound so you shouldn't thrash the machine. Another option (although not the same experiment), is to look for differentially methylated regions, like CpG islands.

I'm actively working on a refactor of the dmr module when running on individual genome positions that will speed it up dramatically as well as add some new indices. It should be coming your way shortly.

The scoring section of the docs describes the model used. Don't hesitate if you have any questions.

ArtRand avatar Feb 07 '24 18:02 ArtRand

Hello @bebert30,

The latest version of modkit (v0.2.5) should be considerably faster to calculate DMR scores and pvalues (details) at single bases. Please give it a try at your convenience.

ArtRand avatar Mar 05 '24 14:03 ArtRand

Hi @ArtRand, Thank you so much for following up with this. the version is indeed considerably faster. However, is there a way to split the DMR and DhMR calls? the output is calling m and h modifications together (see attached screenshot). Im not sure how to best use this file for subsequent analyses? In addition, I want to look at whole genome levels of methylation and hydroxymethylation. Is there a statistics function in modkit that can help me do this? Many thanks!

Screenshot 2024-02-20 at 7 19 48 PM

bebert30 avatar Mar 20 '24 15:03 bebert30

Hello @bebert30,

It looks like you used a 5hmC/5mC model, so the output of modkit dmr will consider both of these modifications. The score takes multiple modifications into account, so a higher score can mean changes in modification level (modified/not-modified) or changes in modification type (i.e. 5hmC <-> 5mC). If you don't care to keep the 5hmC calls around, you can remove them during the pileup step by specifying --ignore h (see the code snippets in the docs). Keep in mind that when you use --ignore the modification code will become C meaning "any C-mod".

To look at whole genome methylation, I recommend using the bedMethyl tables. You can load these into a data frame library like pandas or polars (run modkit pileup with the --only-tabs option, I've pasted some polars code below and Epi2Me has some pandas code). You can also just use awk, for example (taken from issue 2):

awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10} END{print (can/valid) " CpG canonical\n" (mod/valid) " 5mCpG modified\n" (oth/valid) " 5hmCpG modified"}' $pileup_bed

Polars (python) code:

# polars requires --only-tabs flag to be set when performing modkit pileup (see above)
pl.read_csv(
        modkit_file,
        separator="\t",
        has_header=False,
        dtypes={
            "chrom": str,
            "start": int,
            "end": int,
            "name": str,
            "score": int,
            "strand": str,
            "tstart": int,
            "tend": int,
            "color": str,
            "coverage": int,
            "freq": float,
            "mod": int,
            "canon": int,
            "other": int,
            "del": int,
            "fail": int,
            "diff": int,
            "nocall": int,
        },
    )

Hope this helps.

ArtRand avatar Mar 21 '24 21:03 ArtRand