modkit icon indicating copy to clipboard operation
modkit copied to clipboard

How does modkit pileup decide which sites to include compared to extract calls

Open HenrydMl opened this issue 8 months ago • 2 comments

Hi,

So I'm currently looking at m6A RNA methylation, and required more information than is given in the pileup (especially concerning call probability). Using modkit-extract-call, I tried to recreate a table similar to modkit pileup while using the following filters on extract calls before grouping it by chrom + ref_pos:

calls_df_filter <- calls_df[calls_df$ref_position != -1,]
calls_df_filter <- calls_df_filter[calls_df_filter$fail == "false",]

## Creating a summary table and adapting call probability for m6a calls
m6a_call_filtered_df <- calls_df_filter
m6a_call_filtered_df$call_prob <- ifelse(m6a_calling, calls_df_filter$call_prob, 1 - calls_df_filter$call_prob)
m6a_call_filtered_df$call_code <- ifelse(m6a_calling, "a", "a")
m6a_call_filtered_df$modified <- ifelse(m6a_calling, "true", "false")

## Collapsing per chrom and per ref_pos, while adding some extra values:
combine_probs <- function(p) {
        p <- p[!is.na(p)]
        site_probability <- 1 - prod(1 - p)
        return(site_probability)
    }


transcript_level_df <- m6a_call_filtered_df %>%
    group_by(chrom, ref_position, mod_strand) %>%
    summarize(
        n_reads = n(),
        Prob_mod = combine_probs(call_prob),
        # query_kmer = names(which.max(table(query_kmer))),
        # multiple_kmers = n_distinct(query_kmer) > 1,
        count_modified = sum(modified=="true"),
        .groups = 'drop'
    )

However, although the resulting dataframe contains all the sites from modkit-pileup, it also includes some (~25000) extra sites that passed these filters, despite having similar features in extract-calls than sites which were included in pileup:

read_id	forward_read_position	ref_position	chrom	mod_strand	ref_strand	ref_mod_strand	fw_soft_clipped_start	fw_soft_clipped_end	read_length	call_prob	call_code	base_qual	ref_kmer	query_kmer	canonical_base	modified_primary_base	fail	inferred	within_alignment	flag
## Not in pileup
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981	540	2669	ENST00000000233	+	+	+	422	2	1029	0.6748047	-	35	.	GCAGC	A	A	false	false	true	0
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981	616	2869	ENST00000000233	+	+	+	422	2	1029	0.70996094	a	35	.	CCACG	A	A	false	false	true	0
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981	639	2895	ENST00000000233	+	+	+	422	2	1029	0.6591797	-	35	.	TCAGG	A	A	false	false	true	0
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981	887	3144	ENST00000000233	+	+	+	422	2	1029	0.71972656	-	42	.	CCAGA	A	A	false	false	true	0
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981	967	3226	ENST00000000233	+	+	+	422	2	1029	0.6533203	-	29	.	TGAGG	A	A	false	false	true	0
59a6acde-f821-490b-852c-90b69e700354	3	8519	ENST00000001008	+	+	+	3	0	451	0.67285156	a	16	.	AAATG	A	A	false	false	true	0

## In Pileup
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981	546	2675	ENST00000000233	+	+	+	422	2	1029	0.89941406	-	43	.	GCACG	A	A	false	false	true	0
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981	553	2806	ENST00000000233	+	+	+	422	2	1029	1	-	33	.	GTATG	A	A	false	true	true	0
e6cfd1a2-fb8b-4e1d-ad8d-37d8c8ee0981	559	2812	ENST00000000233	+	+	+	422	2	1029	0.73339844	a	48	.	CCAGG	A	A	false	false	true	0


Do you know why, and how could I reach the results from pileup with this approach? I thought that the fail column ued the same threshold as pileup...

Thank you for your help!

HenrydMl avatar Apr 25 '25 14:04 HenrydMl

Hello @HenrydMl,

Could you send me the modkit pileup command you're using as well as the calculated pass threshold that it reports (or the debug log)? You might want to group_by the ref_strand not the mod_strand. If you send me a small modBAM I can try and replicate the result and get back to you.

ArtRand avatar Apr 28 '25 16:04 ArtRand

Hi @ArtRand , thank you for looking into this!

Indeed grouping by ref_strand helps a bit, but there are still positions not found in the pileup which pass my filters. Also, I should precise that I do not filter for the Phred score from the extract-calls otherwise my resulting table excludes some sites from the pileup.

The command I'm using for modkit pileup is the following (in Nextflow):

modkit pileup --ignore 17596 \\
        ${sorted_bam} \\
        ${sample}.bed --with-header

and the extract calls is:

modkit extract calls --ignore 17596 ${sorted_bam} ${sample}_extract_calls.bed

The command log for pileup is:

> calculated chunk size: 6, interval size 100000, processing 600000 positions concurrently
> ignoring mod code 17596
> attempting to sample 10042 reads
> Using filter threshold 0.72558594 for A.
> Done, processed 718635 rows. Processed ~7798 reads and skipped zero reads.

And the log for extract calls, it is:

> found BAM index, processing reads in 100000 base pair chunks
> attempting to sample 10042 reads
> processed 7991 reads, 1771567 rows, skipped ~4977 reads, failed ~0 reads

It does look that for some sites, the individual call probabilities are under the pileup threshold, but still pass for the extract threshold, like the following read, whose position is not found in pileup:

read_id	forward_read_position	ref_position	chrom	mod_strand	ref_strand	ref_mod_strand	fw_soft_clipped_start	fw_soft_clipped_end	read_length	call_prob	call_code	base_qual	ref_kmer	query_kmer	canonical_base	modified_primary_base	fail	inferred	within_alignment	flag

49a24e61-c8e8-47e8-a3e8-38ed94b0ccb7	473	19392	ENST00000686077	+	-	-	11	0	476	0.66503906	-	5	.	AAATC	A	A	false	false	true	16

The sorted bam_file from dorado can be found here:

https://github.com/HenrydMl/toShare/tree/main/issue_423_modkit

HenrydMl avatar Apr 29 '25 09:04 HenrydMl