modkit Request: additional output columns for extract

Hi @ArtRand,

Per our recent conversation at the T2T consortium meeting, I was hoping for the option to add a few extra columns to the extract output in order to improve downstream processing in our new dimelo package, which uses modkit as part of the backend. The fields I currently would like to see are the following (my naming could probably be improved):

read_start_in_ref: 
    the start of the read in the reference genome coordinates. Currently this can be roughly inferred by the 
    ref_position and read_position columns, but any indels in the read relative to the reference can mean 
    the approximated reference coordinate start for the read is a little bit off, and off by different amounts 
    for different modification motifs.

read_end_in_ref: 
    the end of the read in the reference genome coordinates, for similar reasons to read_start

motif: 
    to make it easy to extract multiple potentially overlapping motifs into the same .txt file, then identify them 
    properly later, it would be beneficial if in addition to the kmer context we could get the specific motif that 
    the line in question was queried with. Currently, I would need to re-implement a motif finder in the 5mer, 
    including handling ambiguous cases like H or D, which you've clearly already got under the hood, so I'd
    prefer to not replicate existing logic and instead get a readout I can parse.

Let me know if there's anything I can clarify! Best, Oberon

Sep 28 '24 23:09 OberonDixon

Hello @OberonDixon,

Nice to hear from you. These all seem like good suggestions. I'll send you a test build once I have it. Thanks!

Sep 29 '24 16:09 ArtRand

@ArtRand just checking in on this - I'm doing more software package dev work this month and it could be good to test this change.

Nov 07 '24 00:11 OberonDixon

Hello @OberonDixon, sorry for the delay. Within a week or so - sorry for the delay.

Nov 08 '24 23:11 ArtRand

Hi @ArtRand, it's been a while since I've checked back on this but I do not believe this feature is currently implemented. Let me know if you can make a version in the next few weeks, as I am working on a software package preprint right now and if this feature were implemented it would simplify and speed up some of my single read work.

Mar 15 '25 19:03 OberonDixon

Hello @OberonDixon You'll have it this week. Looking forward to the preprint.

Mar 23 '25 03:03 ArtRand

Hello @OberonDixon,

Here is a build with the columns you've asked for. Specifically

column	name	description	type
10	alignment_start	leftmost reference position the read is aligned to	int
11	alignment_end	rightmost reference position the read is aligned to	int
22	motifs	motifs that this base is aligned to	str

10 and 11 are pretty self-explanatory, but keep in mind that for (-)-strand mapped reads the alignment_start will be the end of the read. For motifs (22), the output is similar to what pileup will give you when you have multiple motifs: CCGG,1;CG0 means that the position matches both CCGG at position 1 and CG at position 0.

I've also added a flag called --annotate-motifs that will direct Modkit to only annotate which positions match which motifs, but keep all positions. In the released version, when you use --motif it will narrow the output to only matching positions.

Could you tell me if these extra columns satisfy your use case? I still have a little engineering work to do to speed things up, but it would be good to make sure the schema is acceptable.

Here are some easy commands to run on the test data, you can run these from within the repo directory.

modkit extract calls \
    tests/resources/bc_anchored_10_reads.sorted.bam \
    ./extract_calls.tsv \
    --reference tests/resources/CGI_ladder_3.6kb_ref.fa \
    --filter-threshold 0.8 \
    --motif CCGG 1 --force --annotate-motifs

modkit extract full \
    tests/resources/bc_anchored_10_reads.sorted.bam \
    ./extract_full.tsv \
    --reference tests/resources/CGI_ladder_3.6kb_ref.fa \
    --motif CCGG 1 --force --annotate-motifs

modkit_u16_x86_64.tar.gz

Mar 27 '25 17:03 ArtRand

Hi @ArtRand, I tested it out and I think these new columns do exactly what I need! When can these get into a conda release so I can include them in my installable visualization and processing package?

Apr 02 '25 23:04 OberonDixon

@OberonDixon very soon!

Apr 08 '25 15:04 ArtRand