proteinfold icon indicating copy to clipboard operation
proteinfold copied to clipboard

No error on GPU OOM Colabfold

Open ppenev-eligo opened this issue 11 months ago • 2 comments

Description of the bug

When submitting the pipeline through Seqera the platform does not evoke a 1 exit status if colabfold runs out of memory on GPU. The pipeline returns a success 0 status which prevents retries with different GPUs.

Command used and terminal output

Generated through Seqera:

nextflow run 'https://github.com/nf-core/proteinfold' -name test_run -params-file 'https://api.cloud.seqera.io/ephemeral/[id].json' -with-tower -r [tower-hash]

Relevant files

task-6.command.out.txt

task-6.command.err.txt

System information

  • Nextflow v24.10.3
  • Google Cloud
  • Executor = 'google-batch'
  • container Docker [COLABFOLD_BATCH:nf-core/proteinfold_colabfold:dev]
  • OS Linux DeepLearning VM with CUDA preinstalled; However the CUDA on the docker image is 12.6 which solves some jax issues.
  • nf-core/proteinfold v1.1.1

ppenev-eligo avatar Feb 06 '25 13:02 ppenev-eligo

Hi @ppenev-eligo,

looking to reproduce this but would require a few additional details to be able to do so:

  • Which dataset did you use? Any of the test profiles from proteinfold? Knowing the data you used would help to recreate the error.
  • Which machine on google-batch did you use? This can help recreate the error of having insufficient GPU memory.

This looks like generally Nextflow isn't reporting the proper error code and unrelated to Seqera Platform but I would like to be able to reproduce it to be sure.

FloWuenne avatar Mar 24 '25 15:03 FloWuenne

Hi @FloWuenne Thanks for looking into this!

  • I can try rerunning with different sequences that I can share, but I suspect anything large enough will cause the same issue - this specific run had a total sequence length of ~6000 aminoacids across 24 protein chains.

  • I used a custom template of a machine that has the following properties:

kind: compute#instanceTemplate
name: gpu-24gb-l4
properties:
  canIpForward: false
  confidentialInstanceConfig:
    enableConfidentialCompute: false
  description: ''
  disks:
  - autoDelete: true
    boot: true
    deviceName: sp-test
    index: 0
    initializeParams:
      diskSizeGb: '200'
      diskType: pd-balanced
      sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu124-v20241224-debian-11
    kind: compute#attachedDisk
    mode: READ_WRITE
    type: PERSISTENT
  guestAccelerators:
  - acceleratorCount: 1
    acceleratorType: nvidia-l4
  keyRevocationActionType: NONE
  machineType: g2-standard-16

The idea was that prediction will fail due to too little GPU memory, then the pipeline on retry will use a different template with larger gpu, which I have specified separately as additional configuration to Nextflow. It seems the fail message gets lost (indeed most likely Nextflow issue) and the pipeline happily reports that "task has failed successfully", so no further retries are issued.

ppenev-eligo avatar Mar 25 '25 15:03 ppenev-eligo