sarek icon indicating copy to clipboard operation
sarek copied to clipboard

TIDDIT_SV produces `*.tiddit.ploidies.tab` but not `*.tiddit.vcf`, timing out and failing to complete

Open SpikyClip opened this issue 1 year ago • 5 comments

Description of the bug

Spent the last few weeks trying to troubleshoot this bug where tiddit would consistently fail to complete for certain samples and not others. However, within the work process folder, *.tiddit.ploidies.tab is completed, and the process sits there until it times out with exit 140 (up to 8 hours later).

Coinciding with these issues is unusually high memory use for ENSEMBLVEP_VEP and FREEBAYES, which I have had to up to 144GB and 65GB, respectively, to avoid OOM issues. I don't know if this is related. I am working on a cluster running the nextflow head job in an interactive smux.

I run pretty much all the tools sarek has on my tumour-only WGS, and apart from the three processes above, everything else runs fine.

Things I've tried:

  1. Checking md5s for fastq files and double checking for mismatched samples.
  2. Downgrading nextflow (23.10.1 -> 23.04.5)
  3. Downgrading sarek (3.4.0 -> 3.3.2)
  4. Downgrading tiddit (3.6.1 -> 3.3.2)
  5. Redownload igenomes, snpEff, and VEP caches (unrelated, but was also debugging the memory issues at the same time)
  6. Providing more memory to the head job (4GB -> 8GB).
  7. Providing more memory/time/cores to tiddit.
  8. Running sarek -profile test_full (tiddit completes with no issues)

The two aspects of this problem that confuse me are:

  • The outcome is deterministic, i.e. tiddit only fails consistently on certain samples, but the rest of the tools run fine and the output seems normal for these samples (to me at least)
  • *.tiddit.ploidies.tab does get produced in the work folder correctly, its just the vcf that is empty.

Things I've yet to try:

  1. Run sarek on an older dataset which has previously given me no issues (this is my next step, I just need to recover some space on the cluster).

I have attached the relevant pipeline_info, .nextflow.log and 3 examples of tiddit work folders that failed.

Command used and terminal output

Note: I have not reproduced the full script here, but all the `pipeline_info` parameters are in the attached zip file.

main() {
    cmdline "$@"
    module load "singularity/${singularity_module_version}"

    # export NXF_VER="23.04.5"
    export NXF_VER="23.10.1"

    nextflow -log "${nxf_log}" run "${pipeline_name}" \
        -revision "${revision}" \
        -profile "${profile}" \
        -config "${config}" \
        -params-file "${params}" \
        --input "${samplesheet}" \
        --email "${nxf_mail}" \
        --outdir "${outdir}" \
        "${nxf_args[@]:-}"
}


### Relevant files

[tiddit_bug.zip](https://github.com/nf-core/sarek/files/14643142/tiddit_bug.zip)


### System information

- Nextflow: `23.10.1`, `23.04.5`
- Hardware: HPC
- Executor: slurm
- Container engine: Singularity
- OS: `Linux m3-login2 3.10.0-1160.71.1.el7.x86_64 #1 SMP Tue Jun 28 15:37:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux`
- nf-core/sarek: `3.4.0`, `3.3.2`

SpikyClip avatar Mar 19 '24 04:03 SpikyClip

Hey! When you run tiddit manuallly (outside of sarek) on the problematic samples, does it finish and terminate?

FriederikeHanssen avatar Mar 19 '24 06:03 FriederikeHanssen

Was about to give that a go when I realised my run where I downgraded tiddit (3.6.1 -> 3.3.2) actually half the problem samples completed, some very close to the timeout mark (8.h). So I suspect it is some combination of the samples I have and the tiddit version (but I don't know why that is the case).

I am resuming it now, giving TIDDIT_SV process a 30.h time period to see if it covers all the samples. Another thing I noticed is that sarek runs TIDDIT_SV with a process_medium label with 6 cores, however, --threads is not specified to TIDDIT_SV so it ends up running in single core mode. I wonder if it may be more effective to set it up with the correct number of cores to deal with potential samples that end up blowing the time limit for some reason. If my next run is successful I will test that and report back.

Here is the half-successful execution report: execution_report_2024-03-19_09-37-32.txt

SpikyClip avatar Mar 20 '24 00:03 SpikyClip

oh yeah the thread thing is definitely an oversight and we need to fix this on module level. You could add it by setting a custom config and adding there --threads {task.cpus} to the args field

FriederikeHanssen avatar Mar 20 '24 08:03 FriederikeHanssen

So most of my samples manage to complete with TIDDIT 3.3.2, some taking up to 10.5 hours. However 3 of my samples were still stalling around the 24 hour mark. I interrupted them to try adding threads to speed them up but they haven't been successful yet.

I installed tiddit manually and tried to run it on one of these 3 problem samples in their process directory, it stalled at this point:

[2024-03-22 09:50:40] Collecting signals on contig: chr19_KI270933v1_alt
[2024-03-22 09:50:40] Collecting signals on contig: chr6_GL000253v2_alt
[2024-03-22 09:50:40] Collecting signals on contig: chr6_GL000254v2_alt
/home/vajith/miniconda3/envs/tiddit/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3504: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/home/vajith/miniconda3/envs/tiddit/lib/python3.10/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)

I suspect it is a bug with TIDDIT but I'm not sure what the cause is. I have attached a zip with some logs here: partial_success_tiddit.zip

I am at a dead end here so I'm just going to ignore these errors, hopefully someone will have some clue about what is going on here if its affecting other people. Thanks for all your help regardless!

SpikyClip avatar Mar 22 '24 02:03 SpikyClip

I added the --threads in nf-core/modules PR #5371.

famosab avatar Mar 22 '24 11:03 famosab