How does modkit pileup decide which sites to include compared to extract calls
Hi,
So I'm currently looking at m6A RNA methylation, and required more information than is given in the pileup (especially concerning call probability). Using modkit-extract-call, I tried to recreate a table similar to modkit pileup while using the following filters on extract calls before grouping it by chrom + ref_pos:
calls_df_filter <- calls_df[calls_df$ref_position != -1,]
calls_df_filter <- calls_df_filter[calls_df_filter$fail == "false",]
## Creating a summary table and adapting call probability for m6a calls
m6a_call_filtered_df <- calls_df_filter
m6a_call_filtered_df$call_prob <- ifelse(m6a_calling, calls_df_filter$call_prob, 1 - calls_df_filter$call_prob)
m6a_call_filtered_df$call_code <- ifelse(m6a_calling, "a", "a")
m6a_call_filtered_df$modified <- ifelse(m6a_calling, "true", "false")
## Collapsing per chrom and per ref_pos, while adding some extra values:
combine_probs <- function(p) {
p <- p[!is.na(p)]
site_probability <- 1 - prod(1 - p)
return(site_probability)
}
transcript_level_df <- m6a_call_filtered_df %>%
group_by(chrom, ref_position, mod_strand) %>%
summarize(
n_reads = n(),
Prob_mod = combine_probs(call_prob),
# query_kmer = names(which.max(table(query_kmer))),
# multiple_kmers = n_distinct(query_kmer) > 1,
count_modified = sum(modified=="true"),
.groups = 'drop'
)
However, although the resulting dataframe contains all the sites from modkit-pileup, it also includes some (~25000) extra sites that passed these filters, despite having similar features in extract-calls than sites which were included in pileup:
read_id forward_read_position ref_position chrom mod_strand ref_strand ref_mod_strand fw_soft_clipped_start fw_soft_clipped_end read_length call_prob call_code base_qual ref_kmer query_kmer canonical_base modified_primary_base fail inferred within_alignment flag
## Not in pileup
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981 540 2669 ENST00000000233 + + + 422 2 1029 0.6748047 - 35 . GCAGC A A false false true 0
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981 616 2869 ENST00000000233 + + + 422 2 1029 0.70996094 a 35 . CCACG A A false false true 0
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981 639 2895 ENST00000000233 + + + 422 2 1029 0.6591797 - 35 . TCAGG A A false false true 0
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981 887 3144 ENST00000000233 + + + 422 2 1029 0.71972656 - 42 . CCAGA A A false false true 0
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981 967 3226 ENST00000000233 + + + 422 2 1029 0.6533203 - 29 . TGAGG A A false false true 0
59a6acde-f821-490b-852c-90b69e700354 3 8519 ENST00000001008 + + + 3 0 451 0.67285156 a 16 . AAATG A A false false true 0
## In Pileup
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981 546 2675 ENST00000000233 + + + 422 2 1029 0.89941406 - 43 . GCACG A A false false true 0
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981 553 2806 ENST00000000233 + + + 422 2 1029 1 - 33 . GTATG A A false true true 0
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981 559 2812 ENST00000000233 + + + 422 2 1029 0.73339844 a 48 . CCAGG A A false false true 0
Do you know why, and how could I reach the results from pileup with this approach? I thought that the fail column ued the same threshold as pileup...
Thank you for your help!
Hello @HenrydMl,
Could you send me the modkit pileup command you're using as well as the calculated pass threshold that it reports (or the debug log)? You might want to group_by the ref_strand not the mod_strand. If you send me a small modBAM I can try and replicate the result and get back to you.
Hi @ArtRand , thank you for looking into this!
Indeed grouping by ref_strand helps a bit, but there are still positions not found in the pileup which pass my filters. Also, I should precise that I do not filter for the Phred score from the extract-calls otherwise my resulting table excludes some sites from the pileup.
The command I'm using for modkit pileup is the following (in Nextflow):
modkit pileup --ignore 17596 \\
${sorted_bam} \\
${sample}.bed --with-header
and the extract calls is:
modkit extract calls --ignore 17596 ${sorted_bam} ${sample}_extract_calls.bed
The command log for pileup is:
> calculated chunk size: 6, interval size 100000, processing 600000 positions concurrently
> ignoring mod code 17596
> attempting to sample 10042 reads
> Using filter threshold 0.72558594 for A.
> Done, processed 718635 rows. Processed ~7798 reads and skipped zero reads.
And the log for extract calls, it is:
> found BAM index, processing reads in 100000 base pair chunks
> attempting to sample 10042 reads
> processed 7991 reads, 1771567 rows, skipped ~4977 reads, failed ~0 reads
It does look that for some sites, the individual call probabilities are under the pileup threshold, but still pass for the extract threshold, like the following read, whose position is not found in pileup:
read_id forward_read_position ref_position chrom mod_strand ref_strand ref_mod_strand fw_soft_clipped_start fw_soft_clipped_end read_length call_prob call_code base_qual ref_kmer query_kmer canonical_base modified_primary_base fail inferred within_alignment flag
49a24e61-c8e8-47e8-a3e8-38ed94b0ccb7 473 19392 ENST00000686077 + - - 11 0 476 0.66503906 - 5 . AAATC A A false false true 16
The sorted bam_file from dorado can be found here:
https://github.com/HenrydMl/toShare/tree/main/issue_423_modkit