Question about "skipped reads"
Hi, @ArtRand
First, thank you for developing such a useful tool.
I have a question about the definition and behavior of "skipped reads" when using the modkit extract full command. I've noticed a significant difference in the ratio of skipped reads when using the --include-bed flag.
Here are the two commands I ran and their respective outputs:
- Command without --include-bed:
modkit extract full \
--force \
--mapped-only \
--log-filepath "output.full.wo_bed.log" \
"input.bam" \
"output.full.txt"
Result: processed 3,491,586 reads, skipped ~601,161 reads, failed ~535,375 reads (Total number of reads in fastq.gz with ML/MM tags used to produced this input bam file is 3,491,641 reads, almost same as number of processed reads)
- Command with --include-bed:
modkit extract full \
--force \
--mapped-only \
--include-bed "my_regions.bed" \
--log-filepath "output.full.with_bed.log" \
"input.bam" \
"output.full.with_bed.txt"
Result: processed 1,407,735 reads, skipped ~1,600,198 reads, failed ~217,237 reads
As you can see, the ratio of skipped reads increased dramatically when using --include-bed. I also checked the log file specified by --log-filepath, and while it contained detailed information on failed reads, there was no information about skipped reads.
I would appreciate it if you could clarify the following:
- What is the exact definition of a "skipped read" in modkit? (e.g., secondary/supplementary alignments, unmapped reads, etc.)
- Why does the count of skipped reads increase so much with the --include-bed flag? Is my assumption correct that all reads not overlapping the regions in the BED file are counted as "skipped"?
- Is it the intended behavior for the log file to contain no information on skipped reads?
Thank you for your time and help. Best regards, Jen
Hello @jennyp76
Sorry for the delay getting back to you.
tl;dr: The number of skipped reads increases for the reason you guessed, reads that don't overlap the regions in the BED file passed to --include-bed are counted as "skipped".
What is the exact definition of a "skipped read" in modkit? (e.g., secondary/supplementary alignments, unmapped reads, etc.)
It depends on the context. Most of the time skipped reads are non-primary alignment records since without careful alignment parameters it is not possible to discern the base modification positions from these records. There are more details on how to extract information from these records here. Given that it can change, if you want to get a good count of which how many records have valid base modification tags, I'd recommend using modkit modbam check-tags.
Why does the count of skipped reads increase so much with the --include-bed flag? Is my assumption correct that all reads not overlapping the regions in the BED file are counted as "skipped"?
Yes, reads that don't overlap the regions are counted as "skipped".
Is it the intended behavior for the log file to contain no information on skipped reads?
Yes, usually there is nothing wrong with these records, they are discarded for valid reasons. As mentioned above, to get a good QC-check on the MM/ML (modified base) tag information use modkit modbam check-tags.