Dmr pair segment misses clearly differentially methylated region
I have a sample in which I know there is a hypomethylated region (chr11:2698767-2700902) in the KCNQ1OT1 gene. In IGV I can clearly see the difference in methylation between this sample (bottom track) and a 'normal' sample (top track):
I ran dmr pair with segmentation on this sample to see if it would catch it. For the sake of simplicity I used only one normal sample (the same one shown in IGV) as the -a input but in reality we would use more. The segmented output from dmr pair consisted of some very small 'different' segments, most being just a few bps, and the KCNQ1OT1 region was included in a large 'same' segment (ie. dmr pair didn't find the region to be a differentially methylated segment).
For the command I used default settings, apart from adding --segment:
modkit dmr pair \
-a NORMAL.bed.gz \
-b SAMPLE.bed.gz \
-o SAMPLE_dmr.bed \
--segment SAMPLE_dmr_segments.bed \
--header \
--ref hg38.fasta \
--base C \
--threads 16 \
--log-filepath SAMPLE_dmrPair.log \
--force
However, when I look at the individual positions of the KCNQ1OT1 region in the non-segmented output file (also output from dmr pair) it looks pretty good. Most of the positions have quite high positive effect sizes and the methylation frequencies look correct. I don't know if it's helpful but here is an excerpt with a few positions from the non-segmented output file:
| chrom | start | end | a_pct_modified | b_pct_modified | effect_size |
|---|---|---|---|---|---|
| chr11 | 2699187 | 2699188 | 0.71428573 | 0,06666667 | 0.6476190476190476 |
| chr11 | 2699199 | 2699200 | 0.57894737 | 0.0952381 | 0.4837092731829574 |
| chr11 | 2699201 | 2699202 | 0.64705884 | 0.1764706 | 0.47058823529411764 |
| chr11 | 2699213 | 2699214 | 0.75 | 0.2222222 | 0.5277777777777778 |
Could you help me figure out what is going wrong in the segmentation part and how I can fix it? Thank you in advance :smile:
Hello @Aarhus-kga,
I tried to reproduce the result using the colo829 open dataset from epi2me.
Here are the steps:
- Fetch just the region we care about. Currently, Modkit doesn't let you use remote/cloud modBAMs. I already started working on allowing this, but for this test it's pretty easy to just download the region with
samtools.
#!/bin/bash
set -eu
mkdir -p data
region="chr11:2,687,019-2,713,072"
echo "> first"
samtools view -bh https://ont-open-data.s3.amazonaws.com/colo829_2024.03/basecalls/colo829bl/sup/PAU59807.d052sup4305mCG_5hmCGvHg38.bam ${region} > data/PAU59807.d052sup4305mCG_5hmCGvHg38_region.bam
samtools index data/PAU59807.d052sup4305mCG_5hmCGvHg38_region.bam
echo "> second"
samtools view -bh https://ont-open-data.s3.amazonaws.com/colo829_2024.03/basecalls/colo829/sup/PAU59949.d052sup4305mCG_5hmCGvHg38.bam ${region} > data/PAU59949.d052sup4305mCG_5hmCGvHg38_region.bam
samtools index data/PAU59949.d052sup4305mCG_5hmCGvHg38_region.bam
echo "> done"
- Perform pileup and make a 5mC bigWig track at the same time:
#!/bin/bash
set -euo pipefail
mkdir -p output
samples=(
PAU59807.d052sup4305mCG_5hmCGvHg38_region
PAU59949.d052sup4305mCG_5hmCGvHg38_region
)
reference=path/to/hg38.fasta
ls ${reference} > /dev/null
region="chr11:2,687,019-2,713,072"
for sample in "${samples[@]}"; do
echo "> ${sample}"
modkit pileup data/${sample}.bam stdout --preset traditional --ref ${reference} -t 30 --region ${region} | \
tee >(modkit bm tobigwig stdin output/${sample}_m.bw --mod-codes m -g ${reference}.fai --suppress-progress) \
| bgzip -c > output/${sample}.bed.gz
tabix output/${sample}.bed.gz
done
echo "> done"
- Perform dmr with segmentation, same as you did.
set -eu
reference=again/the/path/to/hg38.fasta
normal=output/PAU59807.d052sup4305mCG_5hmCGvHg38_region.bed.gz
tumour=output/PAU59949.d052sup4305mCG_5hmCGvHg38_region.bed.gz
modkit dmr pair \
-a ${normal} \
-b ${tumour} \
-o output/dmr.bed \
--segment output/dmr_segments.bed \
--header \
--ref ${reference} \
--base C \
--threads 80 \
--log-filepath output/modkit_dmr_pair.log \
--force
echo "> done"
In my case I was able to detect the DMR:
Here's the segmentation file I got: (I appreciate that this format isn't the easiest to read, but it's the easiest to copy out of the web!)
chr11 2687254 2688315 same 33.313898260088536 10 m:443 m:341 m:72.98 m:93.17 0.72981876 0.931694 -0.20187521
chr11 2688349 2688727 different 80.52540975925785 6 m:177 m:215 m:49.72 m:95.98 0.497191 0.9598214 -0.4626304
chr11 2689066 2689079 same 0.5235188929015067 2 m:118 m:78 m:93.65 m:97.50 0.93650794 0.975 -0.038492084
chr11 2689255 2689800 different 64.17600099291963 4 m:99 m:141 m:39.29 m:92.76 0.39285713 0.92763156 -0.5347744
chr11 2689940 2692211 same 50.26581878578236 18 m:853 m:589 m:76.85 m:94.39 0.76846844 0.94391024 -0.1754418
chr11 2692659 2692660 different 23.012827639542337 1 m:6 m:27 m:9.84 m:77.14 0.09836066 0.7714286 -0.6730679
chr11 2692948 2694200 same 20.49628309291893 21 m:1114 m:714 m:83.20 m:92.73 0.83196414 0.92727274 -0.0953086
chr11 2694325 2694709 different 14.204092897824353 5 m:136 m:38 m:44.16 m:20.65 0.44155845 0.20652173 0.23503672
chr11 2694763 2696656 same 4.932713292807421 17 m:818 m:499 m:74.03 m:80.88 0.7402715 0.808752 -0.06848049
chr11 2696765 2697584 different 48.79662130782788 5 m:226 m:45 m:73.86 m:27.11 0.7385621 0.27108434 0.46747777
chr11 2697863 2698157 same 6.205202769009645 2 m:67 m:20 m:59.29 m:31.25 0.59292036 0.3125 0.28042036
chr11 2698550 2700889 different 2491.392671529058 191 m:4115 m:29 m:47.85 m:0.50 0.4784884 0.005038221 0.47345015
chr11 2700964 2701110 same -0.058211622514704686 4 m:52 m:52 m:34.21 m:38.52 0.34210527 0.38518518 -0.043079913
chr11 2701127 2702259 different 149.0882987369564 16 m:431 m:602 m:57.62 m:95.40 0.5762032 0.9540412 -0.37783796
chr11 2702332 2702708 same 9.82692933763974 3 m:112 m:126 m:81.75 m:97.67 0.81751823 0.9767442 -0.15922594
chr11 2702830 2705587 different 995.545350006727 59 m:1203 m:2397 m:46.04 m:98.00 0.46039036 0.9799673 -0.5195769
chr11 2705785 2706141 same 22.874641084660425 6 m:236 m:271 m:83.99 m:98.91 0.83985764 0.9890511 -0.14919347
chr11 2706507 2708553 different 351.05309865015533 28 m:495 m:1135 m:42.64 m:91.31 0.42635658 0.9131134 -0.48675683
chr11 2708706 2708723 same 3.595605052324572 4 m:128 m:169 m:89.51 m:97.13 0.8951049 0.97126436 -0.07615948
chr11 2708827 2709210 different 119.53604966875082 14 m:284 m:576 m:51.82 m:91.00 0.5182482 0.9099526 -0.39170438
chr11 2709313 2709624 same 22.465725422638116 11 m:307 m:435 m:72.92 m:90.06 0.72921616 0.9006211 -0.17140496
chr11 2709670 2710634 different 87.58909638888849 8 m:88 m:289 m:29.24 m:79.40 0.29235882 0.79395604 -0.5015972
chr11 2710862 2710863 same 0.965917442410273 1 m:35 m:46 m:83.33 m:93.88 0.8333333 0.93877554 -0.105442226
chr11 2711295 2712946 different 315.9517399441984 25 m:547 m:1042 m:45.13 m:91.56 0.45132014 0.9156415 -0.46432135
A couple things you could try:
- Check that the
pileupsteps are the same or similar to mine. Did you use--preset traditional? - Use the segments that I've posted and run
modkit dmr pairwith these as the--regionshow do the effect sizes in your sample compare to mine?
You could try fiddling with the parameters in the Segmentation Options:
Segmentation Options:
--segment <SEGMENTATION_FP>
Run segmentation, output segmented differentially methylated regions
to this file
--max-gap-size <MAX_GAP_SIZE>
Maximum number of base pairs between modified bases for them to be
segmented together
[default: 5000]
--dmr-prior <DMR_PRIOR>
Prior probability of a differentially methylated position
[default: 0.1]
--diff-stay <DIFF_STAY>
Maximum probability of continuing a differentially methylated block,
decay will be dynamic based on proximity to the next position
[default: 0.9]
--significance-factor <SIGNIFICANCE_FACTOR>
Significance factor, effective p-value necessary to favor the
"Different" state
[default: 0.01]
--log-transition-decay
Use logarithmic decay for "Different" stay probability
--decay-distance <DECAY_DISTANCE>
After this many base pairs, the transition probability will become the
prior probability of encountering a differentially modified position
[default: 500]
--fine-grained
Preset HMM segmentation parameters for higher propensity to switch
from "Same" to "Different" state. Results will be shorter segments,
but potentially higher sensitivity
I would try --fine-grained first, then maybe increase the --significance-factor.
Hi @ArtRand
Thank you for your swift answer :smile: I've been playing around with your suggestions so here we go!
Epi2me data I tried running your code with the colo829 data and got exactly the same result as you. So far so good 👍
My data I then tried running your code with my own data and got the same result as I did before, ie. it didn't catch the dmr.
Your segments run with dmr pair
I took the 'different' segments from your output and made into a bed file used as --regions-bed in dmr pair. The effect size was not included in the output so I've calculated them myself and added them to the output:
chr11 2688349 2688727 chr11:2688349-2688727 0.9819444302766556 m:74 100 m:115 139 m:74.00 m:82.73 0.74 0.82733816 -0.08733816
chr11 2689255 2689800 chr11:2689255-2689800 2.935645158818687 m:54 60 m:67 91 m:90.00 m:73.63 0.9 0.73626375 0.16373625
chr11 2692659 2692660 chr11:2692659-2692660 0.11758862429939398 m:5 8 m:16 20 m:62.50 m:80.00 0.625 0.8 -0.175
chr11 2694325 2694709 chr11:2694325-2694709 3.7789791546573497 m:52 69 m:51 95 m:75.36 m:53.68 0.7536232 0.5368421 0.2167811
chr11 2696765 2697584 chr11:2696765-2697584 0.059082074140746954 m:76 79 m:81 87 m:96.20 m:93.10 0.96202534 0.9310345 0.03099084
chr11 2698550 2700889 chr11:2698550-2700889 775.4080315089705 m:1911 3048 m:526 3330 m:62.70 m:15.80 0.6269685 0.15795796 0.46901054
chr11 2701127 2702259 chr11:2701127-2702259 -0.3042338742745869 m:290 312 m:317 339 m:92.95 m:93.51 0.92948717 0.93510324 -0.00561607
chr11 2702830 2705587 chr11:2702830-2705587 5.795110390276932 m:1089 1334 m:1102 1446 m:81.63 m:76.21 0.8163418 0.76210237 0.05423943
chr11 2706507 2708553 chr11:2706507-2708553 -0.2531645046456106 m:539 652 m:549 657 m:82.67 m:83.56 0.8266871 0.8356164 -0.0089293
chr11 2708827 2709210 chr11:2708827-2709210 0.09976471930849584 m:262 289 m:283 305 m:90.66 m:92.79 0.90657437 0.92786884 0.00870553
chr11 2709670 2710634 chr11:2709670-2710634 -0.15470398194440804 m:124 168 m:119 155 m:73.81 m:76.77 0.7380952 0.7677419 -0.0296467
chr11 2711295 2712946 chr11:2711295-2712946 3.70825041390799 m:337 438 m:409 485 m:76.94 m:84.33 0.7694064 0.843299 -0.0738926
Looking at the dmr I'm interested in (chr11:2698550-2700889) my data has an effect size of 0.46901054 while the epi2me data has 0.47345015 which is very close. However, our data has a b_pct_modified of 0.15795796 while the epi2me data has 0.005038221. While our data has a low percentage it is not quite as overwhelmingly unmethylated as the epi2me data. Could this be the reason why the region is not being caught as a dmr segment?
Fiddling with segmentation options
Fine-grained
Running segmentation with the --fine-grained option didn't make any difference. It still didn't catch the dmr.
Significance factor
I next tried increasing the value of --significance-factor and this actually made a huge difference. At 0.05 and above it starts to make 'different' segments in the region but I have to increase it to 0.07 before it catches the whole region as one dmr. I've tried to show the results of the different values in IGV:
It's awesome that I can now catch the dmr, however, I don't understand the algorithm well enough to tell whether increasing the significance factor could also have some negative effects. Do you know if it will?
I also played around with changing --dmr-prior and --diff-stay but it only did strange things to the output, so I think I should stay away from those :wink: