modkit Recommended way to include modkit as a dependency for a multi-platform python package

I am building a processing and visualization python package for dimelo-seq and other directed methylation or endogenous methylation long read sequencing genome-wide analysis. I've got a working prototype using modkit as my core pileup and mods-on-reads extraction infrastructure, replacing our old custom .bam parsing and a pysam-based earlier prototype. Modkit is great for pileups and a bit slow but useable for getting single-read-level data out.

However, given that there is no python wrapper on pip or conda (that I can find) nor are there currently released executables for non-Linux platforms (even though e.g. my M1 Mac build runs just fine), do you have any recommendations or preferences for how I should include this dependency?

Currently we do not plan on releasing on pypi quite yet, just providing a repo for collaborators to clone and install as a package. However, making our beta users install modkit themselves (or even build it from source) and provide a path to the executable or including modkit executables in our repo both seem less-than-optimal. Automatically downloading as specific version from https://github.com/nanoporetech/modkit/releases as part of setup.py seems perhaps acceptable for Linux, but I'm not confident that this would be future-proof for changes you might make to the modkit repo in future.

A bit of an open-ended question, but interested to hear your thoughts.

Jan 17 '24 00:01 OberonDixon

Hello @OberonDixon,

We've actually considered making a python wrapper/package around modkit, similar to what modbam2bed had, but with extensions to all or most of the modkit functionality. This is probably a little ways off, however, so depending on your timeline you may need to roll your own. There is a conda package for the modkit executable by the way - so you can have your conda env grab that.

Could you elaborate on how modkit is "bit slow but useable for getting single-read-level data out"? We're working on some performance improvements right now so curious what use case you have since extract was improved markedly recently.

Jan 17 '24 02:01 ArtRand

We want a workable prototype to distribute asap, but I'll keep an eye out for the python wrappers.

I didn't find that conda package version when searching the git repo here or Google (I googled "conda modkit" and "pip modkit" and nothing came up) but this is probably sufficient for us for the time being, thank you very much for the link. One quick question: do you intend to add Windows executables at some point? I think we can move forward for the time being without Windows support, but ideally we want our users to be able to use any platform. Anyway, maybe this conda package version should be findable by Google or mentioned elsewhere in the documentation, would have probably avoided me asking the question! Fantastic that it exists.

Our use case for extract is mostly pulling out modification by read for a bunch of different parts of the genome with read depth 30+ sequencing runs. One example is looking at CTCF ChIP-seq peaks, ~41k sites across the genome. The modkit pileup operation for this is pretty quick, and the old code I wrote saving context-checked read modifications directly to vectors in an .h5 file was fairly fast, but the modkit extract function takes a long time (I've been needing to subset to a smaller number of regions for testing, and even for only a thousand kb regions it is on the order of an hour to extract). It won't prohibit us from using the tool, but it is interesting to me that modkit pileup is maybe 3x faster than my pysam/numpy version of similar functionality (built before modkit got arbitrary motif capabilities, which we need) whereas modkit extract is perhaps actually slower than my version even for a single pass. (I haven't tested speed head-to-head because I have already decided modkit is fast enough to use and we don't want to be in the business of building a competing pileup/extract tool when this one is well built and well maintained).

It doesn't help that I currently have modkit doing several passes. I want modkit to handle motif ambiguity checking for e.g. CG,0 vs GCH,1, so I am not duplicating functionality. However, the modkit .txt files don't seem to have a way to tell apart 5mC in a CpG context from 5mC in a GpCpG context without re-checking the 5mer context, which I am trying to avoid for simplicity. This generalizes to other base modification contexts we want to look at in future; I have built my prototype to output each motif into a separate file, and generate them one by one with modkit.

Fundamentally what I want out the end of my pipeline is an indexed file of reads by position (chrom/start/end), with vectors containing valid-context-for-modification and modification-calls-by-a-threshold, so I can quickly and easily make plots from wide areas of the genome or pass vectors for further processing. I can make this with modkit extract + additional processing, but it isn't something extract is optimized for I think.

Jan 17 '24 03:01 OberonDixon

Hello @OberonDixon,

As far as your extract workflow. Make sure you have v0.2.4 (or at least v0.2.3). There was a performance regression in v0.2.2 that would make extract slow. In v0.2.4 I added some additional performance improvements for single-base motifs (the next release will contain even more optimizations).

When running extract, I'm assuming you're using --include-bed to get reads aligned to the CTCF regions (correct?). I haven't tested with tens of thousand of regions, but I'll do that. You may try something like

$ samtools view -L <CTCF_BED> -bh <bam> | modkit extract - <extract_out>

in case the filtering in modkit is the slow part - but I'd be surprised if it is. In my experience the one potential bottleneck with extract is actually writing to storage. It's on the roadmap to output more efficient formats in general.

For multiple contexts, in pileup you can use multiple --motif arguments, so you can do something like

$ modkit pileup <bam> <bed> --motif CG 0 --motif GCG 1 --motif GCH 1

Your pileup output will have rows for each motif when they overlap and they'll be labeled accordingly. You can find an example in the repo.

If you give me a better idea of the schema of the table you're trying to make, maybe I can help a little more (no worries if this is more information than you're willing to share). One thing to potentially leverage is that pileup and extract will output rows in order. For pileup the output is sorted [contig, start, stop], and extract is [read_id, ref_pos], so your downstream step could consume the output stream directly to make your "read rows".

I'll add a link to the conda package in the README.

Jan 17 '24 21:01 ArtRand

Thanks for the clarifications. My latest testing has all been with v0.2.4.

I suspect the writing-to-disk limitation is a big factor in my extract speed issues; my old code writing an .h5 file with 'read_name', 'chromosome', 'start_coordinate', 'length', 'valid_vector', and 'mod_vector' was probably faster because of the output format. I may try the samtools filtering at some point though, to see if that makes a difference.

For motifs, the issue is that while modkit pileup input.bam regions.bed --motif XX Y --motif AA B works great, because I can tell apart the different modifications in the bedmethyl file, modkit extract input.bam regions.bed --motif XX Y --motif AA B seems to not have a way for me to tell which motif is which in the output .txt: there is only a field for the single canonical base and then one for the 5mer context, nothing saying what the motif was.

Here is what I mean.

A pileup bedmethyl might look like this:

chr1	990004	990005	m,CG,0	2	+	990004	990005	255,0,0	2 50.00 1 1 0 0 0 0 0
chr1	990008	990009	m,GCH,1	2	+	990008	990009	255,0,0	2 0.00 0 2 0 0 0 0 0
chr1	990009	990010	a,A,0	2	+	990009	990010	255,0,0	2 0.00 0 2 0 0 0 0 0

While an extract table looks like this:

read_id	forward_read_position	ref_position	chrom	mod_strand	ref_strand	ref_mod_strand	fw_soft_clipped_start	fw_soft_clipped_end	read_length	mod_qual	mod_code	base_qual	ref_kmer	query_kmer	canonical_base	modified_primary_base	inferred	flag
...
0e0f6157-64fe-44a1-a8bb-da895718aea4	9748	989986	chr1	+	+	+	0	0	31128	0.041015625	m	255	GGCAT	GGCAT	C	C	false	0
0e0f6157-64fe-44a1-a8bb-da895718aea4	9770	990008	chr1	+	+	+	0	0	31128	0.45117188	m	255	TGCAT	TGCAT	C	C	false	0
0e0f6157-64fe-44a1-a8bb-da895718aea4	9790	990028	chr1	+	+	+	0	0	31128	0.056640625	m	255	AGCCG	AGCCG	C	C	false	0

The issue is that there is no column with e.g. m,CG,0 or m,GCH,1 for me to use to disambiguate motifs in the extract txt, which I can easily do for the pileup bedmethyl output (which I am currently using to make various plots, with all of my different motifs in a single file). If after modified_primary_base there were a motif string like m,CG,0 or m,GCH,1, then I could disambiguate in this case too.

Adding a check against the 5mer in my code is possible too, but then the txt output isn't really easily human readable anymore and I'm duplicating functionality that I am trying to offload to modkit. I want it to be easy for my users to look at the raw data and when distinguishing modifications that are different by motif context it is easier if there is a specified column, as with the bedmethyl, or different files, as with my current extract pipeline.

Hope this helps clarify!

Jan 18 '24 00:01 OberonDixon

@OberonDixon

Ah I see what you mean, the motif isn't a field in extract, sorry I missed that before. The motif is technically in the ref_kmer field. We could add a column with the motif(s) at a reference position, let me think about it. You could use modkit motif-bed and do a join the extract tables on reference position - although it might be easier to just run a regex on each output kmer.

Also, you may be interested in using the --read-calls option. You can pass the string "null" and discard the normal extract output (if you like). The read calls output can be smaller if you have 5hmC and 5mC.

Jan 18 '24 00:01 ArtRand