modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Modifications per read summary

Open loipf opened this issue 8 months ago • 3 comments

Hi, first, great tool!

Is there some way to get a summary per read how many modifications are present? it could be done with modkit extract full and then sum up the modifications per read, but is there a faster way?

the expected output would look something like this:

read_id,number_m,number_a,number_17802,read_length
read1,0,8,100,1500
read2,50,0,10,1200

thanks in advance

loipf avatar Apr 10 '25 13:04 loipf

Hello @loipf,

No there isn't a way to do that with Modkit currently, but I made you a little program called reads_mods_summary that does. Here's how you use it:

rs=/path/to/reads_mods_summary
modkit extract full ${bam} - --ignore-index | ${rs} > output.csv

You can build it with cargo:

cd reads_mods_summary
cargo build --release

There is a linux executable in the zip also, but it may or may not work on your system, I'd recommend building with Cargo if you can. This program shows how to use mod_kit as a library (crate) but the crate hasn't been optimized for this usage yet so the compile time is longer than necessary. Ideally, I'd have some of the "non-Core" functionality behind feature flags. Soon!

The help looks like this:

Usage: reads_mods_summary [OPTIONS]

Options:
      --filter-threshold <FILTER_THRESHOLD>
          Specify the filter threshold globally or per-base. Global filter threshold can be specified
          with by a decimal number (e.g. 0.75). Per-base thresholds can be specified by colon-separated
          values, for example C:0.75 specifies a threshold value of 0.75 for cytosine modification calls.
          Additional per-base thresholds can be specified by repeating the option: for example
          --filter-threshold C:0.75 --filter-threshold A:0.70 or specify a single base option and a
          default for all other bases with: --filter-threshold A:0.70 --filter-threshold 0.9 will specify
          a threshold value of 0.70 for adenine and 0.9 for all other base modification calls

      --mod-thresholds <MOD_THRESHOLDS>
          Specify a passing threshold to use for a base modification, independent of the threshold for
          the primary sequence base or the default. For example, to set the pass threshold for 5hmC to
          0.8 use `--mod-threshold h:0.8`. The pass threshold will still be estimated as usual and used
          for canonical cytosine and other modifications unless the `--filter-threshold` option is also
          passed

      --head <HEAD_RECORD_SIZE>
          Comsume this many records to determine the set of modification codes in the input. If a new
          modification code arises after this many records, it will not be used

          [default: 5]

  -h, --help
          Print help (see a summary with '-h')

Basically, stream the output of modkit extract full into this program and it'll give you the CSV you're looking for. You can use the --filter-threshold and --mod-thresholds the same way as in other Modkit commands - but it doesn't have the automatic estimation functionality in there. If you want to get the thresholds via the estimation algorithm, use modkit sample-probs first.

If others find this useful, feel free to use the code for now. Ping this thread if you'd like to see it integrated as a proper Modkit command.

reads_mods_summary.zip

ArtRand avatar Apr 11 '25 17:04 ArtRand

thank you so much for the quick response and solution! using the provided Linux executable, everything is working perfectly. Really appreciate it

loipf avatar Apr 14 '25 12:04 loipf

Thanks for building this program! I'm trying to run reads_mods_summary via the piping method and linux executable, butit doesn't seem to work. Is there a way I can run extract first and then take that tsv file manually into the reads_mods_summary program? Thanks!

esh97 avatar Sep 01 '25 15:09 esh97