Median modification probabilities with sample-probs
Hi,
I'm using modkit 0.5.0 on ONT synthetic oligos dataset s3://ont-open-data/modbase-validation_2024.10/ I basecalled their 6mA pod5 from modbase-validation_2024.10/subset folder with [email protected]_6mA@v3 (--modified-bases-threshold 0), then I run:
modkit sample-probs \
-p 0.5 \
--no-sampling \
--only-mapped \
--suppress-progress \
--region "$contig" \
"$sorted_bam" 2>/dev/null | \
# Get just the first line and extract the median (last column)
grep '50' | tr -s ' ' | \
# Use tab as separator and extract the last field
awk -v contig="$contig" '
NF > 0 {
median = $NF
print contig "\t" median
}' >> "$temp_results"
I want to get median modification probabilities per contig. I expect that this code will output median modification probabilities across all As for 6mA model, and as there are only ~25% of As modified in this data, it should not be too high, but I get smth like:
5mers_rand_ref_adapter_01 0.9746094
5mers_rand_ref_adapter_02 0.9824219
5mers_rand_ref_adapter_03 0.9746094
5mers_rand_ref_adapter_04 0.9667969
5mers_rand_ref_adapter_05 0.9863281
5mers_rand_ref_adapter_06 0.9824219
5mers_rand_ref_adapter_07 0.9824219
5mers_rand_ref_adapter_08 0.9746094
5mers_rand_ref_adapter_09 0.9863281
I guess I don't understand smth in modkit sample-probs, so I would appreciate any help or clarification.
Thank you, shaperones
Hello @shaperones
It is a common misconception that the probabilities from sample-probs are the probability of modification. It's actually the probability of each modification call. For each A in a read the base modification model predicts the probability of methylation, say for example 0.05. Then the probability of the base not being modified is 1 - 0.05 = 0.95. When sample-probs runs and encounters this position it will take max(p_unmodified, p_modified) or the 0.95 value.
Oh now I get it, thank you! As far as I understand, there is no common way to get the median probability of modification from modkit? I could imagine calculating it from modkit extract full, as it outputs mod_qual - which is probability of modification, right?
Thank you, shaperones
Hello @shaperones,
You could use modkit extract full and calculate the median of the mod_qual column (optionally filtering on the mod_code column). If you're using a A/6mA model you could interpret this number as the median predicted probability of 6mA per site, per read.