Modifications per read summary
Hi, first, great tool!
Is there some way to get a summary per read how many modifications are present?
it could be done with modkit extract full and then sum up the modifications per read, but is there a faster way?
the expected output would look something like this:
read_id,number_m,number_a,number_17802,read_length
read1,0,8,100,1500
read2,50,0,10,1200
thanks in advance
Hello @loipf,
No there isn't a way to do that with Modkit currently, but I made you a little program called reads_mods_summary that does. Here's how you use it:
rs=/path/to/reads_mods_summary
modkit extract full ${bam} - --ignore-index | ${rs} > output.csv
You can build it with cargo:
cd reads_mods_summary
cargo build --release
There is a linux executable in the zip also, but it may or may not work on your system, I'd recommend building with Cargo if you can. This program shows how to use mod_kit as a library (crate) but the crate hasn't been optimized for this usage yet so the compile time is longer than necessary. Ideally, I'd have some of the "non-Core" functionality behind feature flags. Soon!
The help looks like this:
Usage: reads_mods_summary [OPTIONS]
Options:
--filter-threshold <FILTER_THRESHOLD>
Specify the filter threshold globally or per-base. Global filter threshold can be specified
with by a decimal number (e.g. 0.75). Per-base thresholds can be specified by colon-separated
values, for example C:0.75 specifies a threshold value of 0.75 for cytosine modification calls.
Additional per-base thresholds can be specified by repeating the option: for example
--filter-threshold C:0.75 --filter-threshold A:0.70 or specify a single base option and a
default for all other bases with: --filter-threshold A:0.70 --filter-threshold 0.9 will specify
a threshold value of 0.70 for adenine and 0.9 for all other base modification calls
--mod-thresholds <MOD_THRESHOLDS>
Specify a passing threshold to use for a base modification, independent of the threshold for
the primary sequence base or the default. For example, to set the pass threshold for 5hmC to
0.8 use `--mod-threshold h:0.8`. The pass threshold will still be estimated as usual and used
for canonical cytosine and other modifications unless the `--filter-threshold` option is also
passed
--head <HEAD_RECORD_SIZE>
Comsume this many records to determine the set of modification codes in the input. If a new
modification code arises after this many records, it will not be used
[default: 5]
-h, --help
Print help (see a summary with '-h')
Basically, stream the output of modkit extract full into this program and it'll give you the CSV you're looking for. You can use the --filter-threshold and --mod-thresholds the same way as in other Modkit commands - but it doesn't have the automatic estimation functionality in there. If you want to get the thresholds via the estimation algorithm, use modkit sample-probs first.
If others find this useful, feel free to use the code for now. Ping this thread if you'd like to see it integrated as a proper Modkit command.
thank you so much for the quick response and solution! using the provided Linux executable, everything is working perfectly. Really appreciate it
Thanks for building this program! I'm trying to run reads_mods_summary via the piping method and linux executable, butit doesn't seem to work. Is there a way I can run extract first and then take that tsv file manually into the reads_mods_summary program? Thanks!