Understanding motif search
Hi
I'm working with a bacterial genome and I'm trying to figure out secondary specificity of a methyltransferase. I think the motif search is the correct tool to use here. However it's returning a rather unusual result. I know for a fact that only GCGC 1 sites are modified, but such motif doesn't even show up here. I assume it's the G[m]NC one, though not sure why it shows up in that way. For context, the .bam was generated with a rather robust CG remora model and as far as I understand modified bases were called only in the CG context so I'm unsure how [m]N is even possible. How exactly are the motifs searched for? I tried with some control samples and CG sites, the returned motif is a single [m] and not [m]G, so I assume the [m] is strictly CG which is taken from the bedmethyl file? In this case, what is the recommended usage when the bedmethyl contains information only on CG sites? Any help is appreciated!
edit: just to mention that this testing data is of 1-5x coverage only, but I do know that all GCGC 1 m sites are modified almost fully.
INFO loaded 1 sequence(s)
INFO parsed 789462 bedmethyl records, discarded 0 for low coverage, 789462 total
INFO loaded 516349 low-frequency, 105800 middle-frequency and 148963 high-frequency contexts, 1 modification codes, discarded 0 contexts.
INFO there are 4034 seeds to search at 1 for m
INFO Found 14 motifs:
+-----------+------------+------------+-----------+-----------+
| motif | frac_mod | high_count | low_count | mid_count |
+-----------+------------+------------+-----------+-----------+
| GCGC[m] | 0.91933763 | 5163 | 453 | 840 |
| G[m]NC | 0.91843927 | 59412 | 5276 | 9659 |
| [m]NCGC | 0.91428727 | 16747 | 1570 | 1551 |
| GCG[m] | 0.8821921 | 15164 | 2025 | 3483 |
| [m]TGCY | 0.8801653 | 852 | 116 | 7 |
| [m]NNGCGC | 0.8800615 | 3434 | 468 | 689 |
| CN[m]TNC | 0.87589496 | 734 | 104 | 6 |
| [m]NGCGC | 0.87542814 | 4856 | 691 | 875 |
| GCGCN[m] | 0.8751079 | 3041 | 434 | 779 |
| [m]TGCND | 0.87136084 | 1287 | 190 | 15 |
| CG[m]T | 0.8666036 | 916 | 141 | 4 |
| G[m]TNC | 0.86478543 | 1471 | 230 | 18 |
| [m]TGCNNA | 0.86206895 | 375 | 60 | 4 |
| GCG[m]AA | 0.85746604 | 379 | 63 | 3 |
+-----------+------------+------------+-----------+-----------+
Hello @jorisbalc,
Sorry about the delay getting back.
I assume it's the G[m]NC one, though not sure why it shows up in that way.
This is possible, what you could try is using modkit motif evaluate and/or modkit motif refine with --known-motif GCGC 1 m (docs). If you run these, use --log then you can follow what the algorithm is doing using the structured logging schema (docs).
For context, the .bam was generated with a rather robust CG remora model and as far as I understand modified bases were called only in the CG context so I'm unsure how [m]N is even possible.
So you're not using one of the base modification models released by Nanopore? The sequence motifs are from the reference, so if you had a CG base modification model where the reads are aligned with a very consistent mismatch like this:
query GACGATA
ref GACHATA
You could get motifs at non-CG locations. Basecall errors could also trigger the 5mC call, but it's hard to say without knowing which model you're using. Getting non CG motifs with a CG model context is strange.
How exactly are the motifs searched for?
The algorithm is described here.
I tried with some control samples and CG sites, the returned motif is a single [m] and not [m]G, so I assume the [m] is strictly CG which is taken from the bedmethyl file? In this case, what is the recommended usage when the bedmethyl contains information only on CG sites?
Did you run modkit pileup with --cpg?
just to mention that this testing data is of 1-5x coverage only, but I do know that all GCGC 1 m sites are modified almost fully.
With 5x you're likely going to get more false positive locations (like you're seeing here) since one or two reads could move a location between the "high mod bin" and the "low mod bin". You may try increasing the --high-thresh and decreasing --low-thresh.
Thanks for the reply
I'm using both a finetuned [email protected] canonical basecall model (finetuned in bonito) which manages to call the modified bases correctly (with even greater accuracy due to the modification, ironically). The average accuracy of the model is around 95-96% which seems sufficient enough for this task. Then on top of that, I'm using a modified base model exported from remora, which was prepared as by default parameters in the remora repository. Please note that [m] in this workflow is not 5mC, I'm just using the m label for testing purposes as it is read more easily with some tools.
Whether I use modkit pileup --cpg --ref <.bam> <.out> or just modkit pileup <.bam> <.out> results in the same thing.
Even when I run modkit motif search on the ground truth training data I used, the output looks strange? I'm not sure how to interpret it as the context should be [m]G.
INFO loaded 1 sequence(s)
INFO parsed 712453 bedmethyl records, discarded 150 for low coverage, 712603 total
INFO loaded 0 low-frequency, 0 middle-frequency and 676277 high-frequency contexts, 1 modification codes, discarded 0 contexts.
INFO Found 1 motifs:
+-------+----------+------------+-----------+-----------+
| motif | frac_mod | high_count | low_count | mid_count |
+-------+----------+------------+-----------+-----------+
| [m] | 1 | 712453 | 0 | 0 |
+-------+----------+------------+-----------+-----------+
INFO finished
It makes sense that it is detected since this is ground truth data, but the motif does not make sense to me. Any idea why this might be happening? The .bam are clearly read by other tools like IGV and the mods are labeled correctly at CG sites.
EDIT: I'll check the above solutions in a few days with motif evaluate etc. But I think there's some other underlying issue here where the [m] motif is representing CG as a whole.
Hi @ArtRand ,
> loaded 1 sequence(s)
> parsed 789299 bedmethyl records, discarded 0 for low coverage, 789299 total
> loaded 513297 low-frequency, 107098 middle-frequency and 150649 high-frequency contexts, 1 modification codes, discarded 0 contexts.
> have 1 motifs to evaluate
> evaluated motifs:
+--------+----------+------------+-----------+-----------+-----------+
| motif | frac_mod | high_count | low_count | mid_count | log_odds |
+========+==========+============+===========+===========+===========+
| G[m]NC | 0.919065 | 59333 | 5225 | 9780 | 6.2678194 |
+--------+----------+------------+-----------+-----------+-----------+
The GCGC evaluation:
> loaded 1 sequence(s)
> parsed 789299 bedmethyl records, discarded 0 for low coverage, 789299 total
> loaded 513297 low-frequency, 107098 middle-frequency and 150649 high-frequency contexts, 1 modification codes, discarded 0 contexts.
> have 1 motifs to evaluate
> evaluated motifs:
+--------+-----------+------------+-----------+-----------+-----------+
| motif | frac_mod | high_count | low_count | mid_count | log_odds |
+========+===========+============+===========+===========+===========+
| G[m]GC | 0.9537563 | 56635 | 2746 | 9590 | 7.0840983 |
+--------+-----------+------------+-----------+-----------+-----------+
GCAC evaluation:
> loaded 1 sequence(s)
> parsed 789299 bedmethyl records, discarded 0 for low coverage, 789299 total
> loaded 513297 low-frequency, 107098 middle-frequency and 150649 high-frequency contexts, 1 modification codes, discarded 0 contexts.
> have 1 motifs to evaluate
> evaluated motifs:
+--------+------------+------------+-----------+-----------+----------+
| motif | frac_mod | high_count | low_count | mid_count | log_odds |
+========+============+============+===========+===========+==========+
| G[m]AC | 0.52033466 | 2239 | 2064 | 182 | 2.062237 |
+--------+------------+------------+-----------+-----------+----------+
Thanks for sharing the details.
For your ground truth data, it looks like every reference C passes the high percent modified threshold. This suggests that even C bases outside of a CG context (but where the basecalled sequence is a CG) are being marked as highly modified. In your ground truth strands, are all C bases converted to the modified base of interest? If so, this would explain why the model is calling every site as modified. Training from a fully modified sample is generally not recommended, since it limits the model’s ability to learn the distinction between modified and unmodified sites in other settings. Such data may work for specific test cases, but performance on native or treated samples (where the modification is sparse) is likely to be worse than what you see in this training data. If helpful, we can dig deeper into this in a separate Remora issue.
On the second set of results, I think two things are happening:
- Low coverage and spurious motifs
The CA calls you’re seeing in the
G[m]ACmotif amount to 4,485 calls, compared with 68,971 calls in theC[m]CGmotif. Since you set the motif coverage filter to 1X minimum coverage, even a single basecalling error can cause non-CG sites to appear in the pileup. This is why the default is 5X coverage; lowering it makes the analysis much more sensitive to noise. I realize coverage is limited in your sample, but this is likely the main reason for the extra motifs. Running an error analysis may help confirm this. - Motif expansion during search
The
C[m]NGmotif comes from how the modkit motif search works (see docs). After seeding a motif, the algorithm expands and contracts it as long as the motif passes the coverage and frequency thresholds. Because the non-CG sites occur at such low frequency, the expansion step is still allowed, even though those added positions look weak in isolation. This behavior is intentional; it often helps capture biologically relevant motif ambiguity, so it isn’t something we would likely change.
Recommended fixes:
- Increase the coverage filter slightly (2X or 3X instead of 1X). Even this modest bump should cut down most of the spurious motifs.
- Restrict pileup command to reference CG sites (
--motif CG 0option to pileup). This will exclude sites created by sequencing errors or SNPs (e.g. H→G creating a new CG), but for your use case this is likely acceptable.
These two adjustments should largely resolve what you’re seeing. At a broader level, it would help to know more about the downstream goals of this project so we can recommend the most appropriate analysis strategy.
Thanks for the comment, @marcus1487!
I’ve noticed the lower model efficiency across different samples as well, though it’s not too bad for the task at hand. I’ll post some of my questions about sample prep in the Remora repo later on.
Regarding the motif search, using --motif CG 0 doesn’t change anything as the output still comes out as G[m]NC. It’s not a major issue, but I was confused about why it appears that way. I had assumed a CG 0 model would only call CG sites. Your explanation cleared that up for me, though! I should mention that the basecall accuracy is fairly low since the fine-tuned Bonito model isn’t very optimized yet so that would explain the GCNC calls.
Another question: many of the contexts show up in CG-rich motifs, which makes sense, but could this also be due to a combination of the relatively small genome (E. coli) and the modified base model itself? I’ve noticed that closely spaced CG sites tend to be more difficult to basecall. This is rather an exploratory question, but would you say the other motifs show up due to the model mistakes? Some of them, such as GCG[m] are still hard to understand for me as as the model should not be able to call the GCGC 3 as m, since it's not in a CG context, but the high_count says otherwise... Contexts such as [m]NNNGCGC seem to make sense as valid offtargets though.
The "other" CG rich motifs in question (with ~20X coverage):
+------------+------------+------------+-----------+-----------+
| motif | frac_mod | high_count | low_count | mid_count |
+------------+------------+------------+-----------+-----------+
| [m]NCGC | 0.99828076 | 15097 | 26 | 622 |
| GCGC[m] | 0.99613684 | 3610 | 14 | 429 |
| G[m]NC | 0.9926922 | 54879 | 404 | 5790 |
| [m]NGCGC | 0.9919379 | 3322 | 27 | 573 |
| GCGCN[m] | 0.9823336 | 2391 | 43 | 634 |
| [m]NNGCGC | 0.98066854 | 2435 | 48 | 454 |
| GCG[m] | 0.97085077 | 12290 | 369 | 2888 |
| [m]NNNGCGC | 0.9333547 | 2913 | 208 | 1362 |
| GCGCYN[m] | 0.8606195 | 778 | 126 | 805 |
+------------+------------+------------+-----------+-----------+
I recently just noticed that even though in my new data with coverage of ~20X, that while evaluating the G[m]NC motif, none of the non GC[G]C kmers have any counts found. A follow up question would be, can this simply be caused by deletions? My finetuned bonito model has a tendency to pass some deletions at modified sites...
Evaluate returns the same counts with G[m]GC as with G[m]NC (i.e. high 54879 for both)
Actually, even the ONT 5mCG_5hmCG model returns G[m]NC as the sole motif in control data, with nothing but a single hit at G[m]AC.
modkit motif search --in-bedmethyl barcode09.5mcG_5hmCG.bed --ref /home/joris/ref-seqs/gm119_ref.fasta --threads 20
INFO performing full search for each modification code
INFO loading references from "/home/joris/ref-seqs/gm119_ref.fasta"
INFO loaded 1 sequence(s)
INFO parsed 793340 bedmethyl records, discarded 709240 for low coverage, 1502580 total
INFO loaded 389262 low-frequency, 13595 middle-frequency and 34522 high-frequency contexts, 2 modification codes, discarded 0 contexts.
INFO there are 4264 seeds to search at 1 for h
INFO there are 335 seeds to search at 1 for m
INFO Found 1 motifs:
+--------+-----------+------------+-----------+-----------+
| motif | frac_mod | high_count | low_count | mid_count |
+--------+-----------+------------+-----------+-----------+
| G[m]NC | 0.9987951 | 33157 | 40 | 423 |
+--------+-----------+------------+-----------+-----------+
INFO finished
modkit motif evaluate --known-motif GCGC 1 m --in-bedmethyl barcode09.sup_ody_CG.bed --ref /home/joris/ref-seqs/gm119_ref.fasta
> loading references from "/home/joris/ref-seqs/gm119_ref.fasta"
> loaded 1 sequence(s)
> parsed 793340 bedmethyl records, discarded 709240 for low coverage, 1502580 total
> loaded 389262 low-frequency, 13595 middle-frequency and 34522 high-frequency contexts, 2 modification codes, discarded 0 contexts.
> have 1 motifs to evaluate
> evaluated motifs:
+--------+------------+------------+-----------+-----------+----------+
| motif | frac_mod | high_count | low_count | mid_count | log_odds |
+========+============+============+===========+===========+==========+
| G[m]GC | 0.99882513 | 33156 | 39 | 423 | 17.37176 |
+--------+------------+------------+-----------+-----------+----------+
Thanks for the follow-up and for the detailed results.
Regarding your question about whether the small E. coli genome and the research modified base model could be contributing to the observed motifs, the short answer is yes, model biases can definitely show up in motif search outputs. This is one of the reasons we aim to minimize false positive rates and context bias in production release models, though some level of bias is still expected in fine-tuned or experimental models.
Looking at your listed motifs, all of them still allow for a CG context at the modified base, so the CG calls are very likely driving each of these discovered motifs. Restricting to CG reference bases in your pileup should largely eliminate the extra motif variants. This is also having me consider if there is a nice filter to reduce these outputs by using the fraction of each context contributing to the motif and removing restricted bases within the motif represented by very few observations. I have a feeling this will resolve the core issues found here.
For motifs like GCG[m], these can still occur within a CG context — the modified base is the C, and the next base in the reference could be a G, creating the GCGCG pattern that drives this motif. You could confirm this by running modkit motif evaluate for GCGC followed by each possible next base (A, C, G, T). I’d expect coverage to drop off sharply for all but G, but those low-coverage variants may still persist because they don’t fall below the statistical thresholds for motif pruning.
The last motif (G[m]NC vs. G[m]GC) is particularly interesting. From your evaluation, it looks like only two sites differ between the two motifs — one high and one low. Could you try running modkit pileup using the GCNG 1 motif and inspecting those specific sites in the reference (with the motif option the bedmethly should show the reference motif for each record)? That should clarify whether the expansion is caused by a small number of non-CG alignments or genuine basecalling artifacts.
If those sites don’t explain it, sharing the relevant test data would be very helpful for reproducing and debugging the motif expansion behavior more precisely.
I've concluded that most of the variable motifs are due to the model efficiency, due to it being trained on fully modified ground truth data. This is, as you mentioned, expected. So this is somewhat clear now, thanks!
The G[m]NC issue, I'm still unsure of. While the GCNG 1 pileup does return CG pileups isolated from GCGC sites, there's no reference column in the bedmethyl file. This issue is however, somewhat just a minor inconvenience.