modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Determining Ml score threshold for quantitative comparison of methylation levels across two datasets at the same loci

Open ngamarra opened this issue 10 months ago • 2 comments

Hello!

I am attempting to compare dna adenine methylation across two datasets at the same loci from r10.4 data we have generated in two experimental conditions. We expect the methylation levels to differ substantially between the two datasets, but we want to determine a decently accurate quantitative estimate of the difference. In our analysis we apply an automatic threshold to determine "true" methylation calls.

I have been told that determining the optimal threshold is not trivial and is highly sensitive to sequencing run quality. I have been recommended to use modkit's auto threshold function. However I am worried that this thresholding may be sensitive to the total signal in the dataset and I would be worried that it would introduce distortions in comparisons across the dataset. I guess we are wondering if it would be more appropriate to threshold data using a fixed threshold or a data-informed threshold (and specifically modkits function) especially if we expect big differences between datasets?

ngamarra avatar Feb 10 '25 05:02 ngamarra

Hello @ngamarra,

Do you expect that one of the samples will have very few methylated adenine residues?

What you could do is determine the values on a sampling of both conditions combined. That way I expect you would see enough true 6mA sites to estimate either a single threshold value automatically or manually with modkit sample-probs. Sampling the reads can be somewhat non-trivial, however. I think that samtools view does a pretty good job, you could sample from both conditions, merge, and pipe through modkit summary and/or modkit sample-probs. Happy to advise if you can share some plots.

ArtRand avatar Feb 11 '25 16:02 ArtRand

Thanks @ArtRand !

Estimating on a merged sample of both conditions makes sense in terms of finding something that could work well across both samples. I just wondered though if a constant threshold could distort things in the event that one condition has very few methylated adenines. We don't know for sure how low the signal will be in an absolute sense, but we expect it to be quite low. I expected that the threshold might be determined by the population distribution of mL scores in the bam file by modkit, in which case I feel like the noise could dominate the choice for the low mod adenine condition. Testing on a merged sample would seemingly help overcome this issue.

ngamarra avatar Feb 12 '25 01:02 ngamarra