modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Question about the appropriate threshold for mod_qual

Open gyuhee0928 opened this issue 1 year ago • 5 comments

I am trying to extract read-level methylation information for 5mC and 5hmC. I proceeded with the steps below, but I'm not confident if this is the correct approach, so I would like to ask for clarification.

  1. I used modkit extract to generate the read-level methylation output file and extracted confident calls using mod_qual.

  2. For the threshold of mod_qual, I ran modkit summary for each bam file and used the values below "# pass_threshold_C 0.773437". In other words, I used the pass_threshold_C value from the modkit summary file as the threshold for both 5hmC and 5mC mod_qual. Since this threshold appears to be for base calls, I wanted to ask for right threshold.

How do I set an appropriate threshold for each base modification? Or, is it recommended to not filter modification when they have mod_qual above 0? Also, am I extracting the read-level methylation information correctly using this approach?

gyuhee0928 avatar Nov 25 '24 13:11 gyuhee0928

Hello @gyuhee0928,

You're following the recommended strategy to extract read-level base modification calls. I would use modkit extract calls, which will produce an output that's easier to parse if you care about the calls at each position in each read. It will also perform the automatic threshold estimation so you don't have to do it ahead of time.

How do I set an appropriate threshold for each base modification?

I would recommend you use modkit sample-probs --hist --out-dir ${out_dir} ${bam} then look at the generated plots and tables. You can pick a threshold value for each modification class that corresponds to the 10% lowest confidence calls.

Or, is it recommended to not filter modification when they have mod_qual above 0?

We generally recommend filtering out the lowest confidence calls. There will always be a trade-off between accuracy and retaining more base modification calls, however.

Also, am I extracting the read-level methylation information correctly using this approach?

Generally yes, but I would recommend using modkit extract calls to get the base modification calls.

ArtRand avatar Nov 26 '24 12:11 ArtRand

Thank you for the detailed response to my previous question.

I have a question regarding using modkit extract calls.

When using --pass-only and --include-bed, I want to filter modification calls based on a threshold calculated by sampling from the entire BAM file, and then extract results for several specific regions. My question is: does using the --include-bed option limit the threshold calculation to the regions specified in the BED file, or is the threshold still calculated from the entire BAM?

If the threshold is calculated on the specific regions , I would like to know if there is a way to perform the threshold calculation on the entire BAM file, while extracting read information only from specific regions.

Thank you for your help!

gyuhee0928 avatar Dec 05 '24 07:12 gyuhee0928

Hello @gyuhee0928,

When you pass --include-bed to modkit extract calls the program will only use calls that fall within the regions provided. To do what you're looking for, first run modkit sample-probs ${bam} without any specification of which regions to use. Then grab the threshold value for each primary base (or, as discussed above, determine a value for each canonical and modified base). Then pass these values to modkit extract calls. Keep in mind that the default behavior of modkit extract calls is to output all of the calls and annotate the failing ones in the fail column of the schema. If you want to output only the passing calls (this will reduce the size of the output file) make sure to pass the --pass-only flag.

ArtRand avatar Dec 06 '24 19:12 ArtRand

Thank you so much for your kind response.

To calculate the threshold for the 10% lowest confidence calls for each modification, should I proceed with "modkit sample-probs" and then use the probabilities.tsv file output? Specifically, would it be correct to select the value between range_start and range_end of the row where the percentile_rank is closest to 10 for each modification code?

gyuhee0928 avatar Dec 09 '24 04:12 gyuhee0928

Hello @gyuhee0928,

Yes, use modkit sample-probs then you can find the bin where the percentile rank is ~10 (or just below 10). Then use the range_start value. Due to how the probabilities are stored in the BAM, all values in the interval [range_start, range_end) are stored as the same value so you don't have to find the midpoint.

ArtRand avatar Dec 11 '24 02:12 ArtRand