Perform mean of probability for each read position with modkit extract
Hi all, I would like to know if it is possible to write some bash code after performing a modkit extract, to filter based on ref_position and chrom. For example if I have a base that have the same chromosome and ref_position multiple time, let's say contig_1 for chrom associated with ref position 2 appear 4 times, how can I do a new table which will not contain the four line but only one with the mean average of the call_prob ? Also if the call code is different it should not do the mean so it should also verify this criteria. I'm not interested in the other column of the table as I only want to represent the call_prob vs ref_position. I am kinda new to bioinformatic so any help would be appreciated ! Thanks in advance
Hello @OceaneMion,
I think what you want is to use modkit pileup, find the documentation online. If you need to measure the correspondence between two reference positions this table is what you want.
Maybe I'm not quite understanding what you need, could you provide a concrete example?
Thank you for your awnser, yes I think pileup is more suitable indeed. But I don't exactly understand how pileup works ? This is an example of the first lines of my output :
contig_1 2 3 h 411 . 2 3 255,0,0 411 12.90 53 16 342 0 0 0 3 contig_1 2 3 m 411 . 2 3 255,0,0 411 83.21 342 16 53 0 0 0 3 contig_1 4 5 h 433 . 4 5 255,0,0 433 13.16 57 14 362 1 0 5 11 contig_1 4 5 m 433 . 4 5 255,0,0 433 83.60 362 14 57 1 0 5 11
Is the column 11 : fraction modified, the percentage to have a methylation mark at this position ? Also I don't really understand the link with the mod_qual from modkit extract. I know that modkit extract will give me the probability of a base at a specific position in a read, but sometime I have reads that have different id but same ref position and chromosome so I would like to perform a mean on those.
I may have another question, is the bedgraph 4th column giving me the probability of the methylation at a specific position it look like this : contig_1 2 3 0.8321168 411 contig_1 4 5 0.83602774 433 contig_1 47 48 0.5102041 441 contig_1 94 95 0.75 132 And on what is based this probability ? Does it uses all the reads that had the same positions for a specific base to calculate it ?
Hello @OceaneMion,
During pileup, each read's base modification probability (mod_qual) is converted into a base modification call, i.e. a decision as to the state of the base as modified or canonical. Those calls are then aggregated at each genome position and grouped by the modification code. That's why you'll have a row for h and m, each has the percentage of reads calling this code. The details on the columns are in the documentation. The bedgraph output has the following schema:
| column | name | description | type |
|---|---|---|---|
| 1 | chrom | name of reference sequence from BAM header | str |
| 2 | start position | 0-based start position | int |
| 3 | end position | 0-based exclusive end position | int |
| 4 | fraction modified | Nmod / Nvalid_cov | float |
| 5 | Nvalid_cov | the valid coverage. Nvalid_cov = Nmod + Nother_mod + Ncanonical | int |
These values are the same as the corresponding ones in the bedMethyl output.
There are details and examples on how the pass thresholds effect the base modification calls in the documentation as well. I believe what you want is the fraction_modified value.
Thanks a lot ! So if I want to plot the probability distribution of each modified bases at each genomic position, I will need to use the fraction_modified values right ? or do I need something else?
Hello @OceaneMion,
Basically, yes. If you have a model that predicts more than one modification at a base (e.g. 5hmC/5mC at cytosine bases), you'll have a categorical distribution where the empirical $p_{\text{mod}}$ is equal to the fraction of modification for that mod and $p_{\text{canonical}} = 1 - \sum_{\text{mod}}p_{\text{mod}}$. So taking your previous example:
contig_1 2 3 h 411 . 2 3 255,0,0 411 12.90 53 16 342 0 0 0 3
contig_1 2 3 m 411 . 2 3 255,0,0 411 83.21 342 16 53 0 0 0 3`
You'd have:
$$ p_{\text{h}} = 0.129 $$
$$ p_{\text{m}} = 0.8321 $$
$$ p_{\text{canonical}} = 1 - (0.8321 + 0.129) = 0.0389 $$
These probabilities define the categorical distribution at that site. There are, of course, a multitude of fancier things you can do, but this is a good place to start.
Hello, As far as I understand, despite the name of this issue here you are discussing how to retreive the frequency of a modification rather than its probability, right? I would like to extract for each modification the average modification probability in a certain position. i.e. out of 100 reads, 25 are modified with 80% probability, 25 with 90% probability (the probability threshold for modkit being 70%) --> I would have in the bedmethyl output 50% frequency and I would like to output also the mean modification probability (85%). This is something that for example I can visualise in IGV but I could not find a command for modkit to have it written in the ouput (it rather uses the probability to assign a base as modified or not and from that proportion it only calculates the frequency).
I am talking about RNA modifications here, and about probability at per read level. Any help would be greatly appreciated.
Thanks!
Hello @soniacruciani,
Currently to get the information you want you'll need to use modkit extract calls then group by the reference position and modification code and aggregate. Something like this:
import polars as pl
X_extract = pl.read_csv("extract_calls.tsv", separator="\t")
X_prob_agg = X.group_by(["chrom", "ref_position", "call_code"]).agg([
pl.col("call_prob").mean().alias("mean_prob")
])
This will probably be somewhat inefficient unless you break the reference into segments and extract the calls from each segment. What is your use case for these probabilities? Maybe I can add the column to pileup where the data is easier to access.
Hi, Thanks for the quick response! The idea is to have some "confidence score" associated to a site besides the predicted frequency (i.e. I have observed high stoichiometry sites with overall lower probability and viceversa), so keeping the info in the pileup output table would allow to keep this information in mind when analysing the results. It would be great if you could add the column! I guess modkit calculates the info anyway during the process for decision making so it would just have to write it. Not sure if it calculates based on mean or median probability, possibly the latter would be more informative. Thanks,
Sonia
Hello @soniacruciani,
That is an interesting idea, I'll give it some thought. Do you have some examples of these two cases? My intuition would be that the modified base model is less condifent in some sequence contests and/or when there are neighboring modifications. The pass-threshold is meant to take care of this, but I could see where there is some nuance. Feel free to email me art.rand [at] nanoporetech.com
hello @ArtRand , My case was hypothetical but I've seen it often in IGV. It would be helpful to have the probability measure in order to evaluate the "confidence" of the predictions, as for RNA mods the low frequency of a modification is possible and does not imply lower confidence of the presence of that mod. If you can add an option to output the average likelihood cumulating the likelihood of each read in the pileup output I think it would be the easiest to play with.
Hi all,
I would also be interested in extracting the (mean or median) modification probability for each position in addition to the already reported modification percentage, when running modkit pileup. Currently there doesn't seem to be a flag for that, although it is likely computed internally by modkit.
Thanks,
Gregor
Hello @GregorD96,
You're correct, there isn't a way to get this information from pileup right now. I'm working on adding some additional data to the pileup output and considering a few options. I'll keep this thread updated.
One thing you may want to keep an eye on is the number of failed modification calls at a position.
I have the same question. Could the output of sample-probs help with this in some way? For example, if we have the percentile ranks of each of the probability scores at a particular position? Also, for some reason, I have not yet figured out how to restrict the output of modkit commands (pileup, sample-probs) to methylated A's. An answer to this will be also be very helpful, thanks!
Hello @psidhwani,
I have the same question. Could the output of sample-probs help with this in some way? For example, if we have the percentile ranks of each of the probability scores at a particular position?
I have to give this some thought, as of right now I don't have a solution other than to use the sample-probs output to manually set threshold values per modification.
Also, for some reason, I have not yet figured out how to restrict the output of modkit commands (pileup, sample-probs) to methylated A's. An answer to this will be also be very helpful, thanks!
For pileup you can use --motif A 0. The sample-probs command doesn't have a motif option, but all of the modifications are output individually. You could remove the modifications with adjust and --motif, but honestly I'm not sure it's worth making the extra files when you could just remove/ignore the outputs for non-A mods.