modkit summary (number of reads containing modifications)
Hi,
i wanted to get the number (fraction) of reads in my samples that contain at least one methylated C.
For that I tried using modkit summary on a sample where I don't expect any reads carrying methylation at any position:
modkit summary --only-mapped -t 16 --no-sampling --mod-thresholds m:0.85 <bam>
The output was this:
# bases C
# total_reads_used 11985181
# count_reads_C 11985181
# pass_threshold_C 0.74609375
base code pass_count pass_frac all_count all_frac
C - 38837294 0.99435806 42415950 0.97769004
C h 137290 0.0035150598 626020 0.014429796
C m 83070 0.0021268558 341873 0.007880191
As expected the last line confirms my expectation (number of methylated bases), but the line "count_reads_C" confuses me.
There I was expecting to find the number of reads carrying at least one methylation but it just shows all the reads.
The documentation says: 3+ count_reads_{base} total number of reads that contained base modifications for {base} int
Is there a bug in modkit summary or is it just a false description in the doc?
What would be the most efficient way to get the number/fraction of reads in a BAM that carry at least one methylation?
Thanks in advance, Paul
Hello @pkerbs,
A couple things.
For that I tried using modkit summary on a sample where I don't expect any reads carrying methylation at any position:
All of the modified base models have a small, but non-zero, false positive rate, so with long enough reads you'll nearly always get a false positive modification call. I think you already know this, but just stating the obvious.
As expected the last line confirms my expectation (number of methylated bases), but the line "count_reads_C" confuses me. There I was expecting to find the number of reads carrying at least one methylation but it just shows all the reads.
I should revise the documentation to say:
total number of reads that contained base modification _probabilities_ for {base} int
So in other words, this is the count of reads that have base modification information of any kind for this base.
For your use case, try the program I've made for this issue. The output gives you the number of base modifications per read which can easily be filtered. If this ends up being used more broadly, I'll incorporate it into Modkit.
Hi Arthur, thank you for the clarification.
I figured I could maybe achieve this by doing:
modkit extract calls \
--filter-threshold 0.85 \
--mod-thresholds m:0.85 \
--threads 16 \
--mapped-only \
--ignore-index \
<bam> \
- | awk '$14=="m" {print}' | cut -f1 | sort | uniq | wc -l
--> result: 288261 reads
But I also tried your reads_mods_summary tool like this:
#!/bin/bash
eval "$(conda shell.bash hook)"
# Set path to binary
reads_mod_summary_bin=<path to reads_mods_summary bin>
# Define Input/Output
bam=${1}
outdir=$(dirname $bam)
sample_id=$(basename "$bam" .${bam##*.})
# Activate Modkit environment (modkit v0.5.0)
conda activate modkit_env
# Execute
modkit extract full \
--ignore-index \
--mapped-only \
--threads 16 \
${bam} \
- | ${reads_mod_summary_bin} \
--filter-threshold 0.85 \
--mod-thresholds m:0.85 \
> ${outdir}/${sample_id}.modsummary.csv
But unfortunately the output file did not contain the column "number_m".
I then removed the --mapped-only parameter from the modkit extract full command in the aforementioned script. After re-running the script, the column "number_m" was present in the output.
Counting all reads with at least one modification in the resulting output file (awk -F "," '$3 > 0 {print}' <modsummary.csv> | wc -l`), resulted to: 245919 reads
- Shouldn't be the number from the latter approach be at least as high as from the first approach, since I am also including unmapped bases in the latter approach? (if my first approach is correct)
- Why is the column "number_m" missing when using
--mapped-onlywithmodkit extract fullandreads_mods_summary?
Sorry, for bugging you with such minor functionality stuff :)
Best, Paul
Hello @pkerbs,
No problem at all. More than happy to help.
This command:
modkit extract calls \
--filter-threshold 0.85 \
--mod-thresholds m:0.85 \
--threads 16 \
--mapped-only \
--ignore-index \
<bam> \
- | awk '$14=="m" {print}' | cut -f1 | sort | uniq | wc -l
Will count the reads that have a 5mC call and will include calls failing the threshold requirement. The modkit extract calls output column 20 indicates whether or not the call has a probability >= the pass threshold specified for that modification state. So you need to filter on that column as well. Otherwise I think you command should work so long as sort doesn't explode in memory.
But unfortunately the output file did not contain the column "number_m".
What output to you get? I imagine what is happening is all of the calls are failing to pass the threshold values. Do you get a header that looks like this?
read_id,number_unmodified,number_fail,read_length
If you only care about reads with mapped (i.e. aligned) cytosines, you should keep the --mapped-only flag. Maybe consider lowering the thresholds or exploring a threshold to pick with modkit sample-probs --hist --out-dir docs here. If you keep using the reads_mods_summary tool increase the --head parameter to something like 5000. That tool isn't really "production grade", but it's sounding like the functionality should be put into Modkit properly.
Hi Arthur,
The modkit extract calls output column 20 indicates whether or not the call has a probability >= the pass threshold specified for that modification state.
I see, so the --filter-threshold param does not exclude lower prob calls from the output, but rather marks them as failed. Got it.
So I filtered now like this awk '$14=="m" && $20=="false" {print $1}' | sort | uniq | wc -l which should give me the number of reads having at least one Methylation that passed the threshold (>=0.85).
Result was 68130
What output to you get? I imagine what is happening is all of the calls are failing to pass the threshold values. Do you get a header that looks like this?
Yes, the header looks like in your example.
The column number_m only appears in the output when I remove --filter-threshold from the reads_mods_summary command or --mapped-only from the modkit extract full command.
Might this be a bug in reads_mods_summary?
Just to confirm that I have 5mC passing the threshold in the output generated by modkit extract full, I did this:
modkit extract full \
--ignore-index
--mapped-only\
--threads 32 \
<bam> \
- | awk '$14=="m" && $13>=0.85 {print $1}' | sort | uniq | wc -l
Result was 68130
Best, Paul
Hello @pkerbs,
So I filtered now like this awk '$14=="m" && $20=="false" {print $1}' | sort | uniq | wc -l which should give me the number of reads having at least one Methylation that passed the threshold (>=0.85).
Correct.
Yes, the header looks like in your example. The column number_m only appears in the output when I remove --filter-threshold from the reads_mods_summary command or --mapped-only from the modkit extract full command. Might this be a bug in reads_mods_summary?
It's likely a bug. I'll get a proper implementation in the next Modkit release. The logic in your approach seems fine to me, if it's working - stick with it. I'll probably use the awk command to test the Modkit implementation.
Thank you again for your support. In the context of one of our projects, it might also be of interest for others to have the functionality of read-filtering from a BAM file, separating reads with and without modifications.
Best, Paul