modkit extract m6A tsv problems
Hi, @ArtRand
I encountered some issues while using modkit extract.
Here is my code:
${modkitDir}/modkit extract ${bamfile}/input_merge_sup_m6A_pseU.mod.sorted.bam null --read-calls ${bamfile}/input_merge_sup.pass.m6A.reads.tsv --reference ${index_dir}/gencode.vM33.normal.transcripts.fa --motif DRACH 2 --log-filepath ${bamfile}/input_merge_sup_m6A.reads.log --mapped-only -q 10000 --sample-num-reads 20084 -t 40
I have a table with the same transcript positons but different read id.
I want to know modkit determines if this position is m6A modified?
In here, https://nanoporetech.github.io/modkit/intro_extract.html#note-on-implicit-base-modification-calls you said that fail column "true if the base modification call fell below the pass threshold". But there are still some fail values that are TRUE, but the call_code column is not "a". Why is this?
Best wishes, Kirito
Hello @kir1to455,
There is a common misconception that the call_prob is the "probability of modification", it's not quite that. It's the probability of the call in call_code, there is a brief description in this issue. If the call_code is - (canonical) and the fail is TRUE it means that the call_prob is below the filter threshold (essentially that the model isn't very confident). Hope this answers your question.
Hi, @ArtRand Thank you for your quick reply. So how do I determine if the position of this read is m6A? Is it enough as long as call_code is ' a '?
If that's the case, the results I obtained using modkit pileup and extract seem to be different.
In modkit pileup, I got 3 modification m6A reads in 1139 position.
However, in modkit extract, I got 5 modication m6A reads in 1139 position.
My modkit pileup code is
${modkitDir}/modkit pileup ${bamfile}/ip_merge_sup_m6A_pseU.mod.sorted.bam ${bamfile}/ip_merge_sup.pass.m6A.bed --ref ${index_dir}/gencode.vM33.normal.transcripts.fa --motif DRACH 2 --log-filepath ${bamfile}/ip_merge_sup_m6A.log --num-reads 20084 --max-depth 20000 -t 40
My modkit extract code is
${modkitDir}/modkit extract ${bamfile}/ip_merge_sup_m6A_pseU.mod.sorted.bam null --read-calls ${bamfile}/ip_merge_sup.pass.m6A.reads.tsv --reference ${index_dir}/gencode.vM33.normal.transcripts.fa --motif DRACH 2 --log-filepath ${bamfile}/ip_merge_sup_m6A.reads.log --sample-num-reads 20084 --mapped-only -q 10000 -t 40
Best wishes, Kirito
Hello @kir1to455,
Could you look in the log files for the two commands and tell me what the estimated threshold was? The log lines should look like [src/command_utils.rs::121][2024-08-11 10:59:50][DEBUG] estimated pass threshold .... From the screen shot of the extract table, all of the a calls you have highlighted have fail set to TRUE, so all of these would be counted in the $\text{N}_{\text{fail}}$. My guess is that the threshold estimated in the extract command is unusually high.
Hi @ArtRand,
Here is the modkit extract threshold estimated.
Here is the modkit pileup threshold estimated.
As you said, the threshold estimated in the modkit extract command is higher than modkit pileup. How should I solve this problem?
Best wishes, Kirito
Hello @kir1to455,
Thanks, could you check one more thing for me. In the extract and pileup logs there should be some log lines that describe how the sampling they should look like this:
[src/reads_sampler/sampling_schedule.rs::233][2024-08-13 18:07:27][DEBUG] removed 0 contigs from schedule with <= 1 reads
[src/reads_sampler/sampling_schedule.rs::127][2024-08-13 18:07:27][DEBUG] derived sampling schedule, sampling total 5987 reads from 1 contig, 0 unmapped reads
[src/reads_sampler/sampling_schedule.rs::140][2024-08-13 18:07:27][DEBUG] schedule: SQ: 0, 5987 reads
Could you tell me what these lines are in your runs?
Hi, @ArtRand
I have uploaded the extract and pileup log file. Please check the file. modkit_pileup.log modkit_extract.log
Best wishes, Kirito
Hello @kir1to455,
Sorry for the delayed response. The thresholds are ending up different because in modkit extract the pass threshold is estimated at only the third position DRACH motifs (the center A). On the other hand, in pileup all mapped As are used. The modified base model is more confident in general at DRACH-2 motifs, so the pass threshold estimated in extract is higher. Thanks for exposing this inconsistent behavior. You can force pileup to use only DRACH sites by first making a "motif-bed" then passing that to the --include-bed option like this:
ref=/home/hjliang/genomes/GRCm39_mm10/gencode.vM33.normal.transcripts.fa
modkit motif-bed ${ref} DRACH 2 > ./drach_2_transcripts.bed
modkit pileup \
/public2/hjliang/Nanopore/modkit/ip_merge_sup_m6A_pseU.mod.sorted.bam \
/public2/hjliang/Nanopore/modkit/ip_merge_sup.pass.m6A.bed \
--ref /home/hjliang/genomes/GRCm39_mm10/gencode.vM33.normal.transcripts.fa \
--include-bed ./drach_2_transcripts.bed \
--motif DRACH 2 \
--log-filepath /public2/hjliang/Nanopore/modkit/ip_merge_sup_m6A.log \
--num-reads 20084 --max-depth 20000 -t 40
I'll add the ability to estimate the threshold based only on reference motif sequences in the next release.
Hi @ArtRand , Another question is that the modkit extract will consume more and more memory over time. According to my understanding, modkit extract does not seem to release memory after outputting these reads tsv?
I'm using
${modkitDir}/modkit extract ${bamfile}/ip_merge_sup_m6A_pseU.mod.sorted.bam null --read-calls ${bamfile}/ip_merge_sup.pass.pseU.reads.tsv --reference ${index_dir}/gencode.vM33.normal.transcripts.fa --motif T 0 --log-filepath ${bamfile}/ip_merge_sup_pseU.reads.log --sample-num-reads 20084 --mapped-only -q 10000 -t 40 to get PseU modification.
It takes nearly 200G when I using get DRACH motif modifaction.
Best wishes, Kirito
Hello @kir1to455,
The extract tables can get rather large and buffering them in memory as you've discovered can balloon. Here are three things to try:
- Reduce the argument to
-qto-q 5000. - Pass the
--ignore-implicitflag. This reduces the number of records to be written by omitting the highly confident canonical calls. - Try selecting just a subset of the genome/transcriptome at a time with
--region. Depending on what you're trying to do, you may not need the whole extract table, so just grab the reads you need.
For what it's worth, we're actively improving the extract function and will have a more efficient version in the next major release.