modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Median modification probabilities with sample-probs

Open shaperones opened this issue 3 months ago • 3 comments

Hi,

I'm using modkit 0.5.0 on ONT synthetic oligos dataset s3://ont-open-data/modbase-validation_2024.10/ I basecalled their 6mA pod5 from modbase-validation_2024.10/subset folder with [email protected]_6mA@v3 (--modified-bases-threshold 0), then I run:

modkit sample-probs \
			-p 0.5 \
			--no-sampling \
			--only-mapped \
			--suppress-progress \
           --region "$contig" \
           "$sorted_bam" 2>/dev/null | \
# Get just the first line and extract the median (last column)
       grep '50' | tr -s ' ' | \
       # Use tab as separator and extract the last field
       awk -v contig="$contig" '
       NF > 0 {
           median = $NF
           print contig "\t" median
       }' >> "$temp_results"

I want to get median modification probabilities per contig. I expect that this code will output median modification probabilities across all As for 6mA model, and as there are only ~25% of As modified in this data, it should not be too high, but I get smth like:

5mers_rand_ref_adapter_01	0.9746094
5mers_rand_ref_adapter_02	0.9824219
5mers_rand_ref_adapter_03	0.9746094
5mers_rand_ref_adapter_04	0.9667969
5mers_rand_ref_adapter_05	0.9863281
5mers_rand_ref_adapter_06	0.9824219
5mers_rand_ref_adapter_07	0.9824219
5mers_rand_ref_adapter_08	0.9746094
5mers_rand_ref_adapter_09	0.9863281

I guess I don't understand smth in modkit sample-probs, so I would appreciate any help or clarification.

Thank you, shaperones

shaperones avatar Sep 23 '25 14:09 shaperones

Hello @shaperones

It is a common misconception that the probabilities from sample-probs are the probability of modification. It's actually the probability of each modification call. For each A in a read the base modification model predicts the probability of methylation, say for example 0.05. Then the probability of the base not being modified is 1 - 0.05 = 0.95. When sample-probs runs and encounters this position it will take max(p_unmodified, p_modified) or the 0.95 value.

ArtRand avatar Sep 24 '25 23:09 ArtRand

Oh now I get it, thank you! As far as I understand, there is no common way to get the median probability of modification from modkit? I could imagine calculating it from modkit extract full, as it outputs mod_qual - which is probability of modification, right?

Thank you, shaperones

shaperones avatar Sep 25 '25 05:09 shaperones

Hello @shaperones,

You could use modkit extract full and calculate the median of the mod_qual column (optionally filtering on the mod_code column). If you're using a A/6mA model you could interpret this number as the median predicted probability of 6mA per site, per read.

ArtRand avatar Sep 26 '25 00:09 ArtRand