methylseq Preseq failing most of the time

Anyone running the pipeline will be familiar with this log message:

terminated with an error status (1) -- Error is ignored.

Preseq has a history of failing a lot, especially for small or low complexity files. But it seems to be failing a lot now, maybe all of the time. This needs investigating.

Phil

May 29 '20 10:05 ewels

At the very least, adding an ignore errorStrategy to this process will help your whole run not get killed due to a preseq failure.

/*
 * STEP 9 - preseq
 */
process preseq {
    errorStrategy 'ignore'

Aug 25 '21 23:08 bsiranosian

Yup! The pipeline already has that set as default so you shouldn't need to set that in any additional configs:

https://github.com/nf-core/methylseq/blob/03972a686bedeb2920803cd575f4d671e9135af0/conf/base.config#L59-L61

It would be nice to try to get it to fail a little less though 😅

Sep 04 '21 05:09 ewels

Oh good, I didn't notice that. I'm not sure why my whole run failed then.

Sep 04 '21 14:09 bsiranosian

Later preseq versions received some updates to fail more gracefully, so if you upgrade the preseq version a bit, you should be fine 👍🏻

Nov 03 '22 12:11 apeltzer

I think we're already on 3.1.2 which is quite recent. Do you know when those versions went out? I still see the same failures on every test run.

Nov 03 '22 23:11 ewels

It's tempting to update the config to allow the error exit code, so that we don't always get the pipeline report saying that the pipeline completed with errors (which always worries me / others).

Nov 03 '22 23:11 ewels

Using BED files instead of BAM, as suggested in https://github.com/nf-core/methylseq/issues/96#issuecomment-716986814 could also potentially help..

Nov 03 '22 23:11 ewels

I tried the BED file as input but it still fails

gatk MarkDuplicatesSpark -I ${bam} -O ${sid}.dedup.bam -M ${sid}_markdup_metrics.txt --tmp-dir . -OBI
        gatk EstimateLibraryComplexity -I ${bam} -O ${sid}_est_lib_complex_metrics.txt
        # convert to BED file with paired-ends (BEDPE format)
        bamToBed -i ${sid}.dedup.bam -bedpe >  ${sid}.sorted.bed
        preseq lc_extrap -v -P ${sid}.sorted.bed -o ${sid}.lc.preseq.txt
        preseq c_curve  -v -P ${sid}.sorted.bed -o ${sid}.c.preseq.txt

PAIRED_END_BED_INPUT
  TOTAL READS     = 14155
  DISTINCT READS  = 14006
  DISTINCT COUNTS = 5
  MAX COUNT       = 94
  COUNTS OF 1     = 13956
  MAX TERMS       = 2
  OBSERVED COUNTS (95)
  1	13956
  2	46
  3	1
  5	2
  94	1
  
  ERROR:	max count before zero is less than min required count (4) duplicates removed

Jan 10 '23 08:01 Rohit-Satyam

May I suggest to implement to run in defect mode as suggested by Preseq developer when the number of reads is >50M?

ERROR: too many defects in the approximation, consider running in defect mode

https://github.com/smithlabcode/preseq/issues/29

Oct 04 '23 16:10 bounlu