Preseq failing most of the time
Anyone running the pipeline will be familiar with this log message:
terminated with an error status (1) -- Error is ignored.
Preseq has a history of failing a lot, especially for small or low complexity files. But it seems to be failing a lot now, maybe all of the time. This needs investigating.
Phil
At the very least, adding an ignore errorStrategy to this process will help your whole run not get killed due to a preseq failure.
/*
* STEP 9 - preseq
*/
process preseq {
errorStrategy 'ignore'
Yup! The pipeline already has that set as default so you shouldn't need to set that in any additional configs:
https://github.com/nf-core/methylseq/blob/03972a686bedeb2920803cd575f4d671e9135af0/conf/base.config#L59-L61
It would be nice to try to get it to fail a little less though 😅
Oh good, I didn't notice that. I'm not sure why my whole run failed then.
Later preseq versions received some updates to fail more gracefully, so if you upgrade the preseq version a bit, you should be fine 👍🏻
I think we're already on 3.1.2 which is quite recent. Do you know when those versions went out? I still see the same failures on every test run.
It's tempting to update the config to allow the error exit code, so that we don't always get the pipeline report saying that the pipeline completed with errors (which always worries me / others).
Using BED files instead of BAM, as suggested in https://github.com/nf-core/methylseq/issues/96#issuecomment-716986814 could also potentially help..
I tried the BED file as input but it still fails
gatk MarkDuplicatesSpark -I ${bam} -O ${sid}.dedup.bam -M ${sid}_markdup_metrics.txt --tmp-dir . -OBI
gatk EstimateLibraryComplexity -I ${bam} -O ${sid}_est_lib_complex_metrics.txt
# convert to BED file with paired-ends (BEDPE format)
bamToBed -i ${sid}.dedup.bam -bedpe > ${sid}.sorted.bed
preseq lc_extrap -v -P ${sid}.sorted.bed -o ${sid}.lc.preseq.txt
preseq c_curve -v -P ${sid}.sorted.bed -o ${sid}.c.preseq.txt
PAIRED_END_BED_INPUT
TOTAL READS = 14155
DISTINCT READS = 14006
DISTINCT COUNTS = 5
MAX COUNT = 94
COUNTS OF 1 = 13956
MAX TERMS = 2
OBSERVED COUNTS (95)
1 13956
2 46
3 1
5 2
94 1
ERROR: max count before zero is less than min required count (4) duplicates removed
May I suggest to implement to run in defect mode as suggested by Preseq developer when the number of reads is >50M?
ERROR: too many defects in the approximation, consider running in defect mode
https://github.com/smithlabcode/preseq/issues/29