modkit modkit extract m6A tsv problems

Hi, @ArtRand I encountered some issues while using modkit extract. Here is my code: ${modkitDir}/modkit extract ${bamfile}/input_merge_sup_m6A_pseU.mod.sorted.bam null --read-calls ${bamfile}/input_merge_sup.pass.m6A.reads.tsv --reference ${index_dir}/gencode.vM33.normal.transcripts.fa --motif DRACH 2 --log-filepath ${bamfile}/input_merge_sup_m6A.reads.log --mapped-only -q 10000 --sample-num-reads 20084 -t 40 I have a table with the same transcript positons but different read id. I want to know modkit determines if this position is m6A modified?

In here, https://nanoporetech.github.io/modkit/intro_extract.html#note-on-implicit-base-modification-calls you said that fail column "true if the base modification call fell below the pass threshold". But there are still some fail values that are TRUE, but the call_code column is not "a". Why is this?

Best wishes, Kirito

Aug 09 '24 07:08 kir1to455

Hello @kir1to455,

There is a common misconception that the call_prob is the "probability of modification", it's not quite that. It's the probability of the call in call_code, there is a brief description in this issue. If the call_code is - (canonical) and the fail is TRUE it means that the call_prob is below the filter threshold (essentially that the model isn't very confident). Hope this answers your question.

Aug 10 '24 20:08 ArtRand

Hi, @ArtRand Thank you for your quick reply. So how do I determine if the position of this read is m6A? Is it enough as long as call_code is ' a '?

If that's the case, the results I obtained using modkit pileup and extract seem to be different. In modkit pileup, I got 3 modification m6A reads in 1139 position.

However, in modkit extract, I got 5 modication m6A reads in 1139 position.

My modkit pileup code is ${modkitDir}/modkit pileup ${bamfile}/ip_merge_sup_m6A_pseU.mod.sorted.bam ${bamfile}/ip_merge_sup.pass.m6A.bed --ref ${index_dir}/gencode.vM33.normal.transcripts.fa --motif DRACH 2 --log-filepath ${bamfile}/ip_merge_sup_m6A.log --num-reads 20084 --max-depth 20000 -t 40

My modkit extract code is ${modkitDir}/modkit extract ${bamfile}/ip_merge_sup_m6A_pseU.mod.sorted.bam null --read-calls ${bamfile}/ip_merge_sup.pass.m6A.reads.tsv --reference ${index_dir}/gencode.vM33.normal.transcripts.fa --motif DRACH 2 --log-filepath ${bamfile}/ip_merge_sup_m6A.reads.log --sample-num-reads 20084 --mapped-only -q 10000 -t 40

Best wishes, Kirito

Aug 11 '24 07:08 kir1to455

Hello @kir1to455,

Could you look in the log files for the two commands and tell me what the estimated threshold was? The log lines should look like [src/command_utils.rs::121][2024-08-11 10:59:50][DEBUG] estimated pass threshold .... From the screen shot of the extract table, all of the a calls you have highlighted have fail set to TRUE, so all of these would be counted in the $\text{N}_{\text{fail}}$. My guess is that the threshold estimated in the extract command is unusually high.

Aug 11 '24 18:08 ArtRand

Hi @ArtRand,

Here is the modkit extract threshold estimated.

Here is the modkit pileup threshold estimated.

As you said, the threshold estimated in the modkit extract command is higher than modkit pileup. How should I solve this problem?

Best wishes, Kirito

Aug 12 '24 01:08 kir1to455

Hello @kir1to455,

Thanks, could you check one more thing for me. In the extract and pileup logs there should be some log lines that describe how the sampling they should look like this:

[src/reads_sampler/sampling_schedule.rs::233][2024-08-13 18:07:27][DEBUG] removed 0 contigs from schedule with <= 1 reads
[src/reads_sampler/sampling_schedule.rs::127][2024-08-13 18:07:27][DEBUG] derived sampling schedule, sampling total 5987 reads from 1 contig, 0 unmapped reads
[src/reads_sampler/sampling_schedule.rs::140][2024-08-13 18:07:27][DEBUG] schedule: SQ: 0, 5987 reads

Could you tell me what these lines are in your runs?

Aug 14 '24 01:08 ArtRand

Hi, @ArtRand

I have uploaded the extract and pileup log file. Please check the file. modkit_pileup.log modkit_extract.log

Best wishes, Kirito

Aug 14 '24 02:08 kir1to455

Hello @kir1to455,

Sorry for the delayed response. The thresholds are ending up different because in modkit extract the pass threshold is estimated at only the third position DRACH motifs (the center A). On the other hand, in pileup all mapped As are used. The modified base model is more confident in general at DRACH-2 motifs, so the pass threshold estimated in extract is higher. Thanks for exposing this inconsistent behavior. You can force pileup to use only DRACH sites by first making a "motif-bed" then passing that to the --include-bed option like this:

ref=/home/hjliang/genomes/GRCm39_mm10/gencode.vM33.normal.transcripts.fa

modkit motif-bed ${ref} DRACH 2 > ./drach_2_transcripts.bed
modkit pileup \
  /public2/hjliang/Nanopore/modkit/ip_merge_sup_m6A_pseU.mod.sorted.bam \
  /public2/hjliang/Nanopore/modkit/ip_merge_sup.pass.m6A.bed \
  --ref /home/hjliang/genomes/GRCm39_mm10/gencode.vM33.normal.transcripts.fa \
  --include-bed ./drach_2_transcripts.bed \
  --motif DRACH 2 \
  --log-filepath /public2/hjliang/Nanopore/modkit/ip_merge_sup_m6A.log \
  --num-reads 20084 --max-depth 20000 -t 40

I'll add the ability to estimate the threshold based only on reference motif sequences in the next release.

Aug 19 '24 00:08 ArtRand

Hi @ArtRand , Another question is that the modkit extract will consume more and more memory over time. According to my understanding, modkit extract does not seem to release memory after outputting these reads tsv?

I'm using ${modkitDir}/modkit extract ${bamfile}/ip_merge_sup_m6A_pseU.mod.sorted.bam null --read-calls ${bamfile}/ip_merge_sup.pass.pseU.reads.tsv --reference ${index_dir}/gencode.vM33.normal.transcripts.fa --motif T 0 --log-filepath ${bamfile}/ip_merge_sup_pseU.reads.log --sample-num-reads 20084 --mapped-only -q 10000 -t 40 to get PseU modification.

It takes nearly 200G when I using get DRACH motif modifaction. c9203b792fc8b0a06beb69f032caf51

Best wishes, Kirito

Aug 19 '24 13:08 kir1to455

Hello @kir1to455,

The extract tables can get rather large and buffering them in memory as you've discovered can balloon. Here are three things to try:

Reduce the argument to -q to -q 5000.
Pass the --ignore-implicit flag. This reduces the number of records to be written by omitting the highly confident canonical calls.
Try selecting just a subset of the genome/transcriptome at a time with --region. Depending on what you're trying to do, you may not need the whole extract table, so just grab the reads you need.

For what it's worth, we're actively improving the extract function and will have a more efficient version in the next major release.

Aug 27 '24 14:08 ArtRand