modkit dmr pair fails with thread '<unnamed>' panicked at modkit-core/src/parsing_utils.rs:34:75:
Hi,
I am running modkit dmr pair on two shallow sequencing runs, and for a while, everything runs without a problem. However, at somepoint after processing millions of sites without a problem the run quits with a rather cryptic error message:
> reading reference FASTA at "/src/hg38/sequence/primary/GRCh38.primary_assembly.genome.fa"
> 11 common sequence(s) between FASTA and both samples
> running single-site analysis
> using default prior, Beta(α: 0.55, β: 0.55)
> estimating max coverages from data
> sampled 574344 a records and 391946 b records, calculating max coverages for 95th percentile
> calculated max coverage for a: 3 and b: 2
> 1278 batches processed
> 1296 batches processed
> 1326 batches processed
> 72718915 sites failed
> 19283149 sites processed successfully
thread '<unnamed>' panicked at modkit-core/src/parsing_utils.rs:34:75:
index out of bounds: the len is 0 but the index is 0
stack backtrace:
./dmr_modkit.sh: line 21: 3716 Abort trap: 6 RUST_BACKTRACE=1 modkit dmr pair -a "${h}" -b "${t}" -o "${dmr_result}" --ref "${ref}" --force --header --base C --threads "8" --log-filepath "dmr.s1.log"
The logs don't contain any more information.
Any help would be appreciated!
Hello @liscruk,
Sorry about the delay in getting back to you.
Could you show me the modkit pileup command that you used to generate the bedmethyl files? Is is possible that the files have been corrupted in some way? Unless the files are huge could you try
$ awk '{print NF}' $file | uniq
You should get only one line that says 18. This error comes from Modkit trying to parse the "mod code / name" (column 4) in the bedMethyl records. It is possible for Modkit to produce records where the modification code and the motif are present, such as C,CG,0 - although there are tests for these cases it is possible that you've managed to find an edge case that confuses the parser.
Thanks for the response. Here is the command. Rather standard, I would guess:
modkit pileup --cpg --ref {params.ref} -t {threads} --prefix {params.pfx} {input.bam} {output.modbed}
awk returns 18 as you expected.
I have recently run modkit pileup in different contexts and it seems the probability of this error increases with size of the output so it might truely be an edge case.
I restricted pileup to a narrower definition of regions of interest and this has worked far better and reliable although it would be great to get the total output even though huge.
Hello @liscruk,
These are the best kinds of bugs.. happens only occasionally and you have to wait in order to observe the symptom :).
My next guess is that the Tabix-handling machinery isn't working correctly and you're getting a corrupted line. There are a couple ways we could debug this:
- I can push a change to github and you can build it. It should still have the problem, but there will be more logging around what the input was that caused the panic. 1.1 I can give you a Linux build that has the same changes (this might take a little longer).
- You can send me the file that exposes the problem via email or I can set up a S3 (AWS) share.
I'm definitely keen to get to the bottom of this.