modkit icon indicating copy to clipboard operation
modkit copied to clipboard

assess the precision of the 4mC ratio

Open hannan666666 opened this issue 11 months ago • 2 comments

Hello, I am working on quantifying the ratio of 4mC in mouse samples, but I have encountered a challenge. According to public papers, 4mC is very rare in mammals. I was wondering if you could provide some guidance on how I can assess the precision of the 4mC ratio of the modkit? Additionally, do you have any strategies to improve its precision, such as setting a higher threshold for the analysis? Thank you very much !!!

bases C total_reads_used 10042 count_reads_C 10042 @ pass_threshold_C 0.640625 base code pass_count pass_frac all_count all_frac C - 33096024 0.9303225 35700393 0.905164 C m 1632598 0.045892 2118287 0.053708013 C 21839 846164 0.0237855 1622119 0.041127943

hannan666666 avatar Jan 15 '25 11:01 hannan666666

Hello @hannan666666,

We recommend testing base modification models on synthetic strands. We've recently published a blog post describing how we derive the model performance metrics. Unfortunately, the 4mC validation data hasn't been released publicly yet.

I ran a test on the validation data I have, using the latest models ([email protected]_4mC_5mC@v3) and attached the pass confusion matrix from modkit validate.

> Call probability threshold: 0.6836
> Percent of modified base calls removed: 9.98%
> Filtered accuracy: 96.85%
> Filtered modified base calls contingency table
                  Called Base
         ┌───────┬────────┬────────┬────────┐
         │       │ C      │ 21839  │ m      │
         ├───────┼────────┼────────┼────────┤
 Ground  │ C     │ 97.83% │  1.75% │  0.42% │
 Truth   │ 21839 │  1.10% │ 98.78% │  0.12% │
         │ m     │  0.45% │  0.02% │ 99.52% │
         └───────┴────────┴────────┴────────┘

The threshold value I'm getting isn't much higher than what you're getting. There will always be a trade-off between increasing the --filter-threshold and the sensitivity of the model. What I would do is look at the output from modkit sample-probs and pick a threshold value for 4mC that corresponds to ~15-20th percentile.

ArtRand avatar Jan 16 '25 00:01 ArtRand

Thank you very much for your kind and informative reply! If possible, could you share the species and the 4mC fraction of your validation sample? My sample is from a mouse, and the 4mC fraction I observed is 0.041127943. Based on your experience, do you think this value is unusually high for mammals? I would greatly appreciate any insights you could provide.

Thank you again for your time and support!

hannan666666 avatar Jan 16 '25 02:01 hannan666666