modkit pileup filter-threshold results in different calls
Hi,
As I understand it modkit pileup defaults to calculating a threshold for calling methylation based on the input data.
In some circumstances this can lead to different methylation calls for the same read in a dataset. Even relatively small shifts in the filter-threshold can change methylation calls.
Is there a sensbile default that can be used in the general case?
If looking for reads with specific methylation patterns and they are fluctuating over the course of an analysis depending on data elsewhere in the genome it quickly becomes very complex!
Thanks.
Hello @mattloose,
In some circumstances this can lead to different methylation calls for the same read in a dataset.
I would expect that when the pass threshold varies slightly some reads' calls are filtered out or included. I would not expect that the calls actually change, is that correct?
Is there a sensible default that can be used in the general case?
To get the best performance, we recommend estimating the threshold from the data. However, this is an estimate of the value because modkit samples a subset of the reads when calculating the value. There are a couple things you can do to help your analysis:
-
You could set
--sampling-frac 1.0when runningpileup. This will use all of the reads in the modBAM to estimate the threshold. However, if you have a large number of reads this is likely not practical. In my experience after you use 100K reads, the values don't change (set this value with--num-reads). -
Use
modkit sample-probs <modBAM> [--no-sampling] [--hist] [-o]to estimate (or calculate with--no-sampling) the value one time and use for that dataset, instead of re-estimating it every time. -
Use
--sample-regionwithpileupor--regionwithsample-probsto specify a region with good coverage and "well behaving" reads. By using the same region, I would expect that the estimated value would be more consistent.
If you're reads are right on the boundary, you're already in the realm of these calls likely being the least reliable, so I would be cautious. Further, I would resist the temptation to estimate one threshold from a dataset and use it for all of them, we've seen that the mass of the output distributions can move slightly. If you're really trying to tease apart the calls, I would run modkit sample-probs --hist, inspect the distributions, and set --filter-threshold and --mod-threshold accordingly. Lastly, if you have data where one kind of modification never occurs (or canonical never occurs), you will likely want to merge that dataset with another one containing the missing base modification state. For example, when we're performing validation of the models, we have strands with 5mC or canonical cytosine, I will run samtools merge $5mC_BAM $canonical_bam -o - | modkit sample-probs - to get the threshold value - but I don't think that's the problem you're having here. Happy to help if you have any more questions.
Thanks for coming back to me.
The issue here is that I want to look at a bam as it comes off the sequencer and I don't want to have to go back and look at the data again later (at least - not during the run).
So I was taking a toy case of looking at two BAM files fresh out of minKNOW - 4000 reads in each. If you run modkit pileup independently on them then merge the bedmethyl outputs in a sensible way it is not the same as running modkit pileup on the two files merged together.
In this example (it was human data) there were around 1500 discordant sites out of those 8000 reads.
Those 1500 sites are either called as methylated in one or other file individually but not in the combined data set - OR they are not called as methylated in either file individually but are in the combined data set. From observation I know that the former is more likely than the latter, but the threshold can go up or down.
I can come up with some workaround where I estimate this using options 2 or 3.
The problem I have with all of this is that a single position in a single read can be called as methylated or not methylated dependent on the rest of the data set and this is most unsatisfying!
The problem I have with all of this is that a single position in a single read can be called as methylated or not methylated dependent on the rest of the data set and this is most unsatisfying!
I apologize ahead of time, maybe I'm not understanding something obvious. A single position at in a single read should have a probability of being methylated, let's take a concrete example:
p_m = 0.9 // prob of 5mC
p_C = 0.1 // prob of canonical cytosine
If the threshold is estimated to be 0.89, this position will be called as 5mC. If the threshold is estimated to be 0.91, the call will be filtered out. In the latter case, the call will not be canonical. Just want to confirm we have the same understanding.
What I can see, however, is that when you perform a pileup you'll get different %-methylated at a given site as the threshold moves around.
If you need an online algorithm, I could imagine a method where you estimate the threshold over a time-batch of reads with a prior from previous experiments with the same model. I can try and come up with something like that if it would help your use case.
Yes - a single position in a single read has a specific probability of being methylated and this does not change.
As you say, when you perform a pileup you will get a different %-methylated at a given site as the threshold moves around.
For a single read that spans a site you can therefore change methylation call at that site in that read.
You have an option (--filter-threshold) for providing a fixed value.
If you were looking at human data what would you think is a sensible value for that threshold?
Given the example values you are giving above I am guessing you would set it quite high (i.e around 0.9)?
Thanks
For a single read that spans a site you can therefore change methylation call at that site in that read.
Are you determining the methylation state of a position in the genome with a pileup, then assign it to reads spanning that site? I can see why you don't want the pileup-based methylation assignment to change if that's the case.
If you mean that in one case you may have a read where the modification call is used and another where it is not, I suppose that is the effect of when you incorporate additional data the threshold value will change.
You may consider only using sites with a given valid coverage so that reads right on the boundary are less likely to flip the call at the site. You may also decrease the --filter-percentile.
When you run pileup you've probably seen that there is a warning message when the estimated threshold is low, <0.6 is considered very low and <0.7 is considered low. Anything above 0.7 (and probably 0.8 for 5mC), is probably OK, but I'm hesitant to give a hard number since it can vary depending on the run, basecaller, etc.
From what I can think of off the bat, the most robust solution will probably be an estimate from the data with a prior - but that's not a solution I can provide for you at the moment. So I would look at the output probabilities from some data you've already collected and pick a number from there given the guidelines above.
@mattloose let me know if you have additional questions.
To add to what Art mentioned, we don't supply globally applicable threshold values, though we do suspect that there are reasonable values for a range of samples. I would echo Arts suggestion that you pick a representative sample or samples and run modkit summary to get a filter threshold that you can apply to all of your runs going forward.
The reason that we do not do this by default is that for example when many calls (>90%) of calls are filtered out the results can look very odd and tracing this back to the large percentage of filtered calls can be a pain. Thus the 10% of calls filtered out was chosen as a more robust default for the majority of use cases. Thus when applying this global threshold you will want to be quite mindful of the percentage of calls filtered from a particular dataset (reported from modkit output). Higher filter percentages can indicate overall lower quality pores, out of spec (higher) modified base content, presence of unmodeled modified bases and any number of other things that might happen in a native biological sample.
Note that if you change the basecalling/modified base calling models I would suggest investigating the threshold value with this new setting.
I hope this helps clarify the ideas around setting this threshold value.