modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Filtering criteria used in Modkit extract calls output

Open CrisDAndres opened this issue 3 months ago • 1 comments

Hello everyone,

code used: ./modkit extract calls calls.bam calls.tsv --reference ref.fasta

I am analyzing modification calls with Modkit. When I view my aligned BAM file in IGV, I see good coverage at certain reference positions. However, when I run modkit extract calls and examine the resulting calls.tsv file, there are far fewer reads reported for those same coordinates

Can anyone explain what filters or criteria Modkit applies when generating the calls.tsv file? Why is the number of reads with modification calls much lower than the total coverage shown in the BAM?

Specifically, I suspect that the reads missing from calls.tsv are those that did not exceed the --modified-bases-threshold (default 0.05) set in Dorado basecalling. Is that correct? Are these reads excluded because they do not reach the modification probability threshold?

If that is the case, shouldn’t those bases still appear in the calls.tsv file as canonical (unmodified) bases?

I would appreciate any clarification on the filtering steps applied and how to interpret the differences in coverage.

Thank you in advance for your help!

Cristina

CrisDAndres avatar Oct 01 '25 15:10 CrisDAndres

Hello @CrisDAndres, sorry for the delay in getting back to you.

Can anyone explain what filters or criteria Modkit applies when generating the calls.tsv file? Why is the number of reads with modification calls much lower than the total coverage shown in the BAM?

It is possible that IGV is showing you non-primary alignments. If you filter the BAM to primary alignments, does the coverage match?

Modkit extract will work with non-primary alignments, but you need to make sure that the correct settings were used during alignment. There are some instructions here the flag to include these records is --allow-non-primary.

Specifically, I suspect that the reads missing from calls.tsv are those that did not exceed the --modified-bases-threshold (default 0.05) set in Dorado basecalling. Is that correct? Are these reads excluded because they do not reach the modification probability threshold?

This option to Dorado will not change which records are included in modkit extract. You can find some documentation about that option here.

ArtRand avatar Oct 07 '25 16:10 ArtRand

Hi @ArtRand

I have a related question: how can I recover per-read calls, all of them, as they appear in the BAM MM/ML tag, e.g. to estimate occupancy = # m6A calls above certain threshold / # all explicit A/m6A calls (ignoring skipped calls) ?

In the docs, you mention that modkit extract calls uses the same thresholding algorithm employed by modkit pileup, but options like --filter-threshold 0 --mod-threshold a:0 --mod-threshold 17802:0 seem to have no effect... and e.g. the call_prob seem to be always above ~0.5 (I'm just trying it on a few reads to test). Or maybe I don't quite understand what's the output of extract calls...?! Maybe related to https://github.com/nanoporetech/modkit/issues/423

I can get what I want with modkit extract full, but --ignore-implicit doesn't seem to work, I have to pipe the output and filter e.g. awk '$20 ~ "inferred|false"'. Am I missing something?

I'm using v6.0.0 and I have not compared with previous versions of modkit. Thanks for your help.

eboileau avatar Dec 18 '25 09:12 eboileau

Hello @eboileau,

I think what you want is a command like this:

modkit extract calls ${bam} stdout \
  --filter-threshold 0.7 \
  --ignore-index \
  | awk -v OFS="\t" '(NR==1)||(($20=="false")&&($21=="false"))'

You can use any combination of --filter-threshold or --mod-thresholds as you need. $20=="false" means that the call passes the threshold value for that base/modification and $21=="false" will skip the inferred(/skipped) calls. I would use --ignore-index for memory and speed. The extract functions are near the top of the list for a perf-refactor.

A little explanation, modkit extract full gives you all of the information in the MM/ML tags, all the mods and their probability. Thing is, if you only care about the call, which is the argmax of the modification probabilities and the canonical probabilities. This is what extract calls does, it flattens each position to the call (modified or unmodified) per-read, per-position. The --filter-threshold apparatus is mostly there for convenience since it doesn't really change the number of records in the output. The filtering threshold only changes the value of the fail column.

ArtRand avatar Dec 19 '25 02:12 ArtRand

Hi @ArtRand Thanks for your quick feedback.

The filtering threshold only changes the value of the fail column.

I see... and the fail column would combine the effect of both --filter-threshold AND --mod-threshold, the result depending on whether we look at a modified base or not... (I'm still not sure how both filters combine here?!)

modkit extract calls ${bam} stdout
--filter-threshold 0.7
--ignore-index
| awk -v OFS="\t" '(NR==1)||(($20=="false")&&($21=="false"))'

... so if I'm interested e.g. in m6A occupancy, given a certain threshold, then the number of (explicit) A calls per read would be given by canonical_base=="A", no matter if these calls were made for m6A, Am, or I, is that right? i.e. if the number of calls differ between modifications, this would be the total number of A calls for that read, and I can't really know how many e.g. m6A calls were made.

Also, how does modkit infer modified_primary_base? Even for inosine, it's always, as far as I could see, equal to canonical_base == A. Are there any other cases where this could be different?

Finally, if I filter for within_alignment=="true", this would remove S and I (CIGAR) operations, right?

Thanks for clarifying.

P.S. Indeed, reducing --queue-size and using --ignore-index is essential otherwise I keep having oom-kill, I already found that out from #306 and #508.

eboileau avatar Dec 19 '25 12:12 eboileau

Hello @eboileau,

the fail column would combine the effect of both --filter-threshold AND --mod-threshold, the result depending on whether we look at a modified base or not

Correct. The threshold values are hierarchical. An option --filter-threshold 0.7 sets a threshold value for all calls (modified, unmodified, any primary sequence base). Then --filter-threshold A:0.8 sets a threshold value for any call on a primary sequence "A". Finally, --mod-threshold a:0.9 sets a threshold value for a (m6A) base modification calls. For example:

call = m6A
p = 0.95 
  => PASS since 0.95 >= 0.9, the threshold for m6A

call = Inosine
p = 0.75
  => FAIL since 0.75 is < 0.8, the threshold for A-calls

call = m5C
p = 0.75
  => PASS since 0.75 is >= 0.7, the "root" threshold for anything that's not A or m6A.

if I'm interested e.g. in m6A occupancy, given a certain threshold, then the number of (explicit) A calls per read would be given by canonical_base=="A"

Correct. Alternatively, you can use the filter modified_primary_base=="A", which is slightly more correct, but for dRNA it won't make a difference.

i.e. if the number of calls differ between modifications, this would be the total number of A calls for that read, and I can't really know how many e.g. m6A calls were made.

If I'm understanding you correctly, to calculate the occupancy per modification, you will need to keep a counter with the number of occurrences of each modification. If you only care about the fraction $\frac{\text{modified}}{\text{A}}$ then the numerator is the number of rows where call_code $\neq$ - and modified_primary_base $=$ A.

Also, how does modkit infer modified_primary_base? Even for inosine, it's always, as far as I could see, equal to canonical_base == A. Are there any other cases where this could be different?

modified_primary_base is the base on which the base modification is attached. canonical_base is the base in the MM tag. modified_primary_base $\neq$ canonical_base on duplex reads where the base modification is on the (-)-strand.

Finally, if I filter for within_alignment=="true", this would remove S and I (CIGAR) operations, right?

Correct.

ArtRand avatar Dec 19 '25 16:12 ArtRand