modkit pileup --interval-size 1000000 --chunk-size 100
Hello
When running modkit pileup on the same BAM and reference genome, I observed output differences depending on whether interval and chunk parameters were set.
Commands
modkit pileup <input.bam> <output_default.bedMethyl>
modkit pileup --interval-size 1000000 --chunk-size 100 <input.bam> <output_chunked.bedMethyl>
Data
- Same sample (129.4M reads)
- Same basecalling, alignment, and input BAM
- Same modkit version and compute environment
Results bedMethyl lines by default = 2,397,050 bedMethyl lines by with --interval-size 1000000 --chunk-size 100 = 2,397,256 Difference: +206 lines
Even with identical inputs, enabling --interval-size and --chunk-size produces a slightly larger bedMethyl output. This suggests these parameters may affect determinism or completeness in CpG methylation calls. Could the team please confirm if this behavior is expected, or if it should be flagged for further investigation in a future release?
We stay in touch. Many thanks in advance. Best, Boris
Hello @blipinskiaima,
I've sent you email with this same message, but for folks reading this issue board I'll report the salient points here.
One thing to check first is that the estimated pass threshold value is the same between runs that you are comparing. This value is output to the console and to the log file specified with --log. There is a new version of pileup coming very soon which has a completely re-worked mechanism. If you'd be willing to test this, since you're taking a hard look at your data, I would appreciate early feedback.