tombo icon indicating copy to clipboard operation
tombo copied to clipboard

Details regarding modified fraction calculation from per-read statistic files

Open drivenbyentropy opened this issue 3 years ago • 3 comments

Hi,

I am trying to reproduce the calculation of the (dampened) modified fraction provided by tombo using the per-read statistic files produced by tombo detect_modifications model_sample_compare for a subset (clusters) of the input data and I was hoping you could let me know if I am doing this correctly.

Based on the documentation and reading through several issues, it is my understanding that for each genomic position, this fraction is calculated based on the number of reads containing a p-value smaller than the single-read-threshold lower limit (0.15 by default for DNA), divided by the number of reads with a p-value above the upper limit (0.5 by default for DNA). Is this correct or are there any additional statistics applied to calculate the final fraction value?

Thanks!

drivenbyentropy avatar Jun 04 '21 16:06 drivenbyentropy

Hi @drivenbyentropy . I am not an ONT employee but I'll do my best to answer. I think the quantity you have described is the "fraction of modified reads" which is not the same as the "dampened fraction of modified reads".

Let d stand for the dampened fraction of modified reads, let f stand for the fraction of modified reads, and let n stand for the total number of reads (all at the same position). Then d ≈ fn/(n+2). This equation is the same for DNA and RNA samples. The equality is only approximate because some rounding is performed (more than just regular floating-point errors).

SycamoreLeaf avatar Jun 04 '21 17:06 SycamoreLeaf

Thank you @Chris-Kimmel for this.

You are correct, I did not apply the pseudo-counts in my initial description. Based on your formula, n corresponds to the total number of reads covering a particular position and not only those above the 0.5 threshold. If this is the case, do you have a sense for what the upper threshold is used for?

drivenbyentropy avatar Jun 04 '21 17:06 drivenbyentropy

Based on your formula, n corresponds to the total number of reads covering a particular position and not only those above the 0.5 threshold.

Yep.

If this is the case, do you have a sense for what the upper threshold is used for?

You have a distribution of p-values; you want a statistic on this distribution that you can use as an estimator of the true fraction of reads modified. I suspect the statistic you're looking at just had a better bias than naive statistics like "divide the number of model_sample_compare p-values less than 0.01 by the total number of reads." Dr. Stoiber would know.

You may be interested in this issue, which gives an idea how Tombo plotting commands can inform choices of the lower and upper thresholds. They're using log-likelihood ratios from tombo detect_modifications alternative_model instead of p-values from tombo detect_modifications model_sample_compare, but you get the gist.

SycamoreLeaf avatar Jun 09 '21 12:06 SycamoreLeaf