modkit Dmr pair segment misses clearly differentially methylated region

I have a sample in which I know there is a hypomethylated region (chr11:2698767-2700902) in the KCNQ1OT1 gene. In IGV I can clearly see the difference in methylation between this sample (bottom track) and a 'normal' sample (top track):

I ran dmr pair with segmentation on this sample to see if it would catch it. For the sake of simplicity I used only one normal sample (the same one shown in IGV) as the -a input but in reality we would use more. The segmented output from dmr pair consisted of some very small 'different' segments, most being just a few bps, and the KCNQ1OT1 region was included in a large 'same' segment (ie. dmr pair didn't find the region to be a differentially methylated segment).

For the command I used default settings, apart from adding --segment:

modkit dmr pair \
-a NORMAL.bed.gz \
-b SAMPLE.bed.gz \
-o SAMPLE_dmr.bed \
--segment SAMPLE_dmr_segments.bed \
--header \
--ref hg38.fasta \
--base C \
--threads 16 \
--log-filepath SAMPLE_dmrPair.log \
--force

However, when I look at the individual positions of the KCNQ1OT1 region in the non-segmented output file (also output from dmr pair) it looks pretty good. Most of the positions have quite high positive effect sizes and the methylation frequencies look correct. I don't know if it's helpful but here is an excerpt with a few positions from the non-segmented output file:

chrom	start	end	a_pct_modified	b_pct_modified	effect_size
chr11	2699187	2699188	0.71428573	0,06666667	0.6476190476190476
chr11	2699199	2699200	0.57894737	0.0952381	0.4837092731829574
chr11	2699201	2699202	0.64705884	0.1764706	0.47058823529411764
chr11	2699213	2699214	0.75	0.2222222	0.5277777777777778

Could you help me figure out what is going wrong in the segmentation part and how I can fix it? Thank you in advance :smile:

Feb 06 '25 11:02 Aarhus-kga

Hello @Aarhus-kga,

I tried to reproduce the result using the colo829 open dataset from epi2me.

Here are the steps:

Fetch just the region we care about. Currently, Modkit doesn't let you use remote/cloud modBAMs. I already started working on allowing this, but for this test it's pretty easy to just download the region with samtools.


#!/bin/bash
set -eu
mkdir -p data
region="chr11:2,687,019-2,713,072"

echo "> first"
samtools view -bh https://ont-open-data.s3.amazonaws.com/colo829_2024.03/basecalls/colo829bl/sup/PAU59807.d052sup4305mCG_5hmCGvHg38.bam ${region} > data/PAU59807.d052sup4305mCG_5hmCGvHg38_region.bam
samtools index data/PAU59807.d052sup4305mCG_5hmCGvHg38_region.bam
echo "> second"
samtools view -bh https://ont-open-data.s3.amazonaws.com/colo829_2024.03/basecalls/colo829/sup/PAU59949.d052sup4305mCG_5hmCGvHg38.bam ${region} > data/PAU59949.d052sup4305mCG_5hmCGvHg38_region.bam
samtools index data/PAU59949.d052sup4305mCG_5hmCGvHg38_region.bam

echo "> done"

Perform pileup and make a 5mC bigWig track at the same time:

#!/bin/bash

set -euo pipefail
mkdir -p output

samples=(
  PAU59807.d052sup4305mCG_5hmCGvHg38_region
  PAU59949.d052sup4305mCG_5hmCGvHg38_region
)
reference=path/to/hg38.fasta
ls ${reference} > /dev/null

region="chr11:2,687,019-2,713,072"
for sample in "${samples[@]}"; do
  echo "> ${sample}"
  modkit pileup data/${sample}.bam stdout --preset traditional --ref ${reference} -t 30 --region ${region} | \
    tee >(modkit bm tobigwig stdin output/${sample}_m.bw --mod-codes m -g ${reference}.fai --suppress-progress) \
    | bgzip -c > output/${sample}.bed.gz
  tabix output/${sample}.bed.gz
done

echo "> done"

Perform dmr with segmentation, same as you did.

set -eu

reference=again/the/path/to/hg38.fasta
normal=output/PAU59807.d052sup4305mCG_5hmCGvHg38_region.bed.gz
tumour=output/PAU59949.d052sup4305mCG_5hmCGvHg38_region.bed.gz

modkit dmr pair \
  -a ${normal} \
  -b ${tumour} \
  -o output/dmr.bed \
  --segment output/dmr_segments.bed \
  --header \
  --ref ${reference} \
  --base C \
  --threads 80 \
  --log-filepath output/modkit_dmr_pair.log \
  --force

echo "> done"

In my case I was able to detect the DMR:

Here's the segmentation file I got: (I appreciate that this format isn't the easiest to read, but it's the easiest to copy out of the web!)

chr11  2687254  2688315  same       33.313898260088536     10   m:443   m:341   m:72.98  m:93.17  0.72981876  0.931694     -0.20187521
chr11  2688349  2688727  different  80.52540975925785      6    m:177   m:215   m:49.72  m:95.98  0.497191    0.9598214    -0.4626304
chr11  2689066  2689079  same       0.5235188929015067     2    m:118   m:78    m:93.65  m:97.50  0.93650794  0.975        -0.038492084
chr11  2689255  2689800  different  64.17600099291963      4    m:99    m:141   m:39.29  m:92.76  0.39285713  0.92763156   -0.5347744
chr11  2689940  2692211  same       50.26581878578236      18   m:853   m:589   m:76.85  m:94.39  0.76846844  0.94391024   -0.1754418
chr11  2692659  2692660  different  23.012827639542337     1    m:6     m:27    m:9.84   m:77.14  0.09836066  0.7714286    -0.6730679
chr11  2692948  2694200  same       20.49628309291893      21   m:1114  m:714   m:83.20  m:92.73  0.83196414  0.92727274   -0.0953086
chr11  2694325  2694709  different  14.204092897824353     5    m:136   m:38    m:44.16  m:20.65  0.44155845  0.20652173   0.23503672
chr11  2694763  2696656  same       4.932713292807421      17   m:818   m:499   m:74.03  m:80.88  0.7402715   0.808752     -0.06848049
chr11  2696765  2697584  different  48.79662130782788      5    m:226   m:45    m:73.86  m:27.11  0.7385621   0.27108434   0.46747777
chr11  2697863  2698157  same       6.205202769009645      2    m:67    m:20    m:59.29  m:31.25  0.59292036  0.3125       0.28042036
chr11  2698550  2700889  different  2491.392671529058      191  m:4115  m:29    m:47.85  m:0.50   0.4784884   0.005038221  0.47345015
chr11  2700964  2701110  same       -0.058211622514704686  4    m:52    m:52    m:34.21  m:38.52  0.34210527  0.38518518   -0.043079913
chr11  2701127  2702259  different  149.0882987369564      16   m:431   m:602   m:57.62  m:95.40  0.5762032   0.9540412    -0.37783796
chr11  2702332  2702708  same       9.82692933763974       3    m:112   m:126   m:81.75  m:97.67  0.81751823  0.9767442    -0.15922594
chr11  2702830  2705587  different  995.545350006727       59   m:1203  m:2397  m:46.04  m:98.00  0.46039036  0.9799673    -0.5195769
chr11  2705785  2706141  same       22.874641084660425     6    m:236   m:271   m:83.99  m:98.91  0.83985764  0.9890511    -0.14919347
chr11  2706507  2708553  different  351.05309865015533     28   m:495   m:1135  m:42.64  m:91.31  0.42635658  0.9131134    -0.48675683
chr11  2708706  2708723  same       3.595605052324572      4    m:128   m:169   m:89.51  m:97.13  0.8951049   0.97126436   -0.07615948
chr11  2708827  2709210  different  119.53604966875082     14   m:284   m:576   m:51.82  m:91.00  0.5182482   0.9099526    -0.39170438
chr11  2709313  2709624  same       22.465725422638116     11   m:307   m:435   m:72.92  m:90.06  0.72921616  0.9006211    -0.17140496
chr11  2709670  2710634  different  87.58909638888849      8    m:88    m:289   m:29.24  m:79.40  0.29235882  0.79395604   -0.5015972
chr11  2710862  2710863  same       0.965917442410273      1    m:35    m:46    m:83.33  m:93.88  0.8333333   0.93877554   -0.105442226
chr11  2711295  2712946  different  315.9517399441984      25   m:547   m:1042  m:45.13  m:91.56  0.45132014  0.9156415    -0.46432135

A couple things you could try:

Check that the pileup steps are the same or similar to mine. Did you use --preset traditional?
Use the segments that I've posted and run modkit dmr pair with these as the --regions how do the effect sizes in your sample compare to mine?

You could try fiddling with the parameters in the Segmentation Options:

Segmentation Options:
      --segment <SEGMENTATION_FP>
          Run segmentation, output segmented differentially methylated regions
          to this file

      --max-gap-size <MAX_GAP_SIZE>
          Maximum number of base pairs between modified bases for them to be
          segmented together

          [default: 5000]

      --dmr-prior <DMR_PRIOR>
          Prior probability of a differentially methylated position

          [default: 0.1]

      --diff-stay <DIFF_STAY>
          Maximum probability of continuing a differentially methylated block,
          decay will be dynamic based on proximity to the next position

          [default: 0.9]

      --significance-factor <SIGNIFICANCE_FACTOR>
          Significance factor, effective p-value necessary to favor the
          "Different" state

          [default: 0.01]

      --log-transition-decay
          Use logarithmic decay for "Different" stay probability

      --decay-distance <DECAY_DISTANCE>
          After this many base pairs, the transition probability will become the
          prior probability of encountering a differentially modified position

          [default: 500]

      --fine-grained
          Preset HMM segmentation parameters for higher propensity to switch
          from "Same" to "Different" state. Results will be shorter segments,
          but potentially higher sensitivity

I would try --fine-grained first, then maybe increase the --significance-factor.

Feb 06 '25 21:02 ArtRand

Hi @ArtRand

Thank you for your swift answer :smile: I've been playing around with your suggestions so here we go!

Epi2me data I tried running your code with the colo829 data and got exactly the same result as you. So far so good 👍

My data I then tried running your code with my own data and got the same result as I did before, ie. it didn't catch the dmr.

Your segments run with dmr pair I took the 'different' segments from your output and made into a bed file used as --regions-bed in dmr pair. The effect size was not included in the output so I've calculated them myself and added them to the output:

chr11   2688349 2688727 chr11:2688349-2688727   0.9819444302766556      m:74    100     m:115   139     m:74.00 m:82.73 0.74    0.82733816	-0.08733816
chr11   2689255 2689800 chr11:2689255-2689800   2.935645158818687       m:54    60      m:67    91      m:90.00 m:73.63 0.9     0.73626375	0.16373625
chr11   2692659 2692660 chr11:2692659-2692660   0.11758862429939398     m:5     8       m:16    20      m:62.50 m:80.00 0.625   0.8	-0.175
chr11   2694325 2694709 chr11:2694325-2694709   3.7789791546573497      m:52    69      m:51    95      m:75.36 m:53.68 0.7536232       0.5368421  0.2167811
chr11   2696765 2697584 chr11:2696765-2697584   0.059082074140746954    m:76    79      m:81    87      m:96.20 m:93.10 0.96202534      0.9310345  0.03099084
chr11   2698550 2700889 chr11:2698550-2700889   775.4080315089705       m:1911  3048    m:526   3330    m:62.70 m:15.80 0.6269685       0.15795796  0.46901054
chr11   2701127 2702259 chr11:2701127-2702259   -0.3042338742745869     m:290   312     m:317   339     m:92.95 m:93.51 0.92948717      0.93510324  -0.00561607
chr11   2702830 2705587 chr11:2702830-2705587   5.795110390276932       m:1089  1334    m:1102  1446    m:81.63 m:76.21 0.8163418       0.76210237  0.05423943
chr11   2706507 2708553 chr11:2706507-2708553   -0.2531645046456106     m:539   652     m:549   657     m:82.67 m:83.56 0.8266871       0.8356164  -0.0089293
chr11   2708827 2709210 chr11:2708827-2709210   0.09976471930849584     m:262   289     m:283   305     m:90.66 m:92.79 0.90657437      0.92786884  0.00870553
chr11   2709670 2710634 chr11:2709670-2710634   -0.15470398194440804    m:124   168     m:119   155     m:73.81 m:76.77 0.7380952       0.7677419  -0.0296467
chr11   2711295 2712946 chr11:2711295-2712946   3.70825041390799        m:337   438     m:409   485     m:76.94 m:84.33 0.7694064       0.843299  -0.0738926

Looking at the dmr I'm interested in (chr11:2698550-2700889) my data has an effect size of 0.46901054 while the epi2me data has 0.47345015 which is very close. However, our data has a b_pct_modified of 0.15795796 while the epi2me data has 0.005038221. While our data has a low percentage it is not quite as overwhelmingly unmethylated as the epi2me data. Could this be the reason why the region is not being caught as a dmr segment?

Fiddling with segmentation options

Fine-grained Running segmentation with the --fine-grained option didn't make any difference. It still didn't catch the dmr.

Significance factor I next tried increasing the value of --significance-factor and this actually made a huge difference. At 0.05 and above it starts to make 'different' segments in the region but I have to increase it to 0.07 before it catches the whole region as one dmr. I've tried to show the results of the different values in IGV:

It's awesome that I can now catch the dmr, however, I don't understand the algorithm well enough to tell whether increasing the significance factor could also have some negative effects. Do you know if it will?

I also played around with changing --dmr-prior and --diff-stay but it only did strange things to the output, so I think I should stay away from those :wink:

Feb 10 '25 10:02 Aarhus-kga