modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Strand count asymmetry in modkit motif compared to find-motifs

Open angelluigi opened this issue 1 month ago • 2 comments

Hi,

I’ve observed a possible inconsistency between modkit motif and modkit find-motifs regarding how strand orientation is handled.

When using find-motifs, the motif counts correctly reflect the total number of genomic sites (combining both forward and reverse-complement strands). However, when I use modkit motif with a known sequence, the total number of detected sites across both strands is approximately correct, but the distribution between the forward and reverse strands is uneven.

For example, if the motif occurs ~300 times on the forward strand and ~300 on the reverse complement in the reference, the output from modkit motif might report something like 120 sites on the “+” strand and 180 on the “−” strand, instead of the expected 300 + 300.

This pattern is reproducible across samples and references. The total number of motif hits is close to what I expect, but the strand-level split is systematically unbalanced.

Could you clarify whether modkit motif applies any strand filtering or alignment-based orientation rule when assigning strand labels to motif hits?

It might help to document this strand handling behavior or provide a flag to ensure symmetric motif detection and reporting across both strands.

Thanks again for your time and for developing such a powerful tool — Modkit has been extremely helpful for methylation motif analysis.

angelluigi avatar Nov 04 '25 15:11 angelluigi

Hello @angelluigi,

To make things a little easier, could you tell me the output of modkit --version and show me the exact commands you're running?

For modkit motif do you mean modkit motif bed?

For example, if the motif occurs ~300 times on the forward strand and ~300 on the reverse complement in the reference, the output from modkit motif might report something like 120 sites on the “+” strand and 180 on the “−” strand, instead of the expected 300 + 300.

Sounds like modkit motif bed is missing about half of the sites? Am I understanding you correctly?

Is there a publicly available reference I can test this on?

Could you clarify whether modkit motif applies any strand filtering or alignment-based orientation rule when assigning strand labels to motif hits?

I must not be fully understanding which command you're using, could you clarify?

It might help to document this strand handling behavior or provide a flag to ensure symmetric motif detection and reporting across both strands.

I agree - I just need an example that reproduces the problem.

Thanks again for your time and for developing such a powerful tool — Modkit has been extremely helpful for methylation motif analysis.

:heart:

ArtRand avatar Nov 05 '25 17:11 ArtRand

Hello @ArtRand,

Thanks again for your reply.

The version of modkit that I used is this:

CALENDULA[ um_caid_1_1@frontend11 dist_modkit_v0.5.1_8fa79e3]$ ./modkit --version
modkit 0.5.1

Below is the set of commands I ran:


MOTIF_SEQ="AGAGNNNNNNGTG"
`MOTIF_OFFSET=2`

  # 1️⃣ PILEUP global
  "$MODKIT" pileup \
    "$BAM" \
    "$BED_GLOBAL" \
    -r "$REF" \
    --filter-threshold A:0.9 \
    --max-depth 50000 \
    --threads "$THREADS" \
    --log-filepath "$LOG_GLOBAL"

  # 2️⃣ PILEUP específico del motivo
  "$MODKIT" pileup \
    "$BAM" \
    "$BED_MOTIF" \
    -r "$REF" \
    --motif "$MOTIF_SEQ" "$MOTIF_OFFSET" \
    --filter-threshold A:0.8 \
    --max-depth 50000 \
    --threads "$THREADS" \
    --log-filepath "$LOG_MOTIF"

  # 3️⃣ FIND-MOTIFS
  "$MODKIT" find-motifs \
    --in-bedmethyl "$BED_GLOBAL" \
    --ref "$REF" \
    --min-sites 50 \
    --threads "$THREADS" \
    --out-table "$TSV_MOTIFS" \
--log-filepath "$LOG_FIND"

I cannot send you the global.bed file, but after checking it carefully, I noticed something interesting at least for me...

When I run the motif-specific pileup (command #2) with the motif AGAGNNNNNNGTG (offset = 2), I get around 300 sites in total. However, these sites are distributed between both strands (+ and –), which I don’t fully understand — roughly 180 on the + strand and 120 on the – strand.

Then, if I run the same command but with the reverse complement motif CACNNNNNNCTCT (offset = 1), I get another ~300 sites, but with the opposite distribution — around 180 on the – strand and 120 on the + strand.

Given that this motif should theoretically have methylation on both orientations (the motif and its reverse complement), it seems that each run only captures half of the expected sites — as if modkit evaluates the forward and reverse motifs separately rather than both together.

I’m trying to understand whether this strand asymmetry is expected — for example, if modkit only counts motifs in the strand orientation in which they appear in the reference genome — or if there might be something in the internal handling of strand information during the pileup --motif step that explains this behaviour.

Would it be correct to assume that find-motifs handles both orientations globally, while pileup --motif only processes the strand where the motif is found in the reference?

Thanks in advanced,

Angel

angelluigi avatar Nov 06 '25 07:11 angelluigi