No error on GPU OOM Colabfold
Description of the bug
When submitting the pipeline through Seqera the platform does not evoke a 1 exit status if colabfold runs out of memory on GPU. The pipeline returns a success 0 status which prevents retries with different GPUs.
Command used and terminal output
Generated through Seqera:
nextflow run 'https://github.com/nf-core/proteinfold' -name test_run -params-file 'https://api.cloud.seqera.io/ephemeral/[id].json' -with-tower -r [tower-hash]
Relevant files
System information
- Nextflow v24.10.3
- Google Cloud
- Executor = 'google-batch'
- container Docker [COLABFOLD_BATCH:nf-core/proteinfold_colabfold:dev]
- OS Linux DeepLearning VM with CUDA preinstalled; However the CUDA on the docker image is 12.6 which solves some jax issues.
- nf-core/proteinfold v1.1.1
Hi @ppenev-eligo,
looking to reproduce this but would require a few additional details to be able to do so:
- Which dataset did you use? Any of the test profiles from proteinfold? Knowing the data you used would help to recreate the error.
- Which machine on google-batch did you use? This can help recreate the error of having insufficient GPU memory.
This looks like generally Nextflow isn't reporting the proper error code and unrelated to Seqera Platform but I would like to be able to reproduce it to be sure.
Hi @FloWuenne Thanks for looking into this!
-
I can try rerunning with different sequences that I can share, but I suspect anything large enough will cause the same issue - this specific run had a total sequence length of ~6000 aminoacids across 24 protein chains.
-
I used a custom template of a machine that has the following properties:
kind: compute#instanceTemplate
name: gpu-24gb-l4
properties:
canIpForward: false
confidentialInstanceConfig:
enableConfidentialCompute: false
description: ''
disks:
- autoDelete: true
boot: true
deviceName: sp-test
index: 0
initializeParams:
diskSizeGb: '200'
diskType: pd-balanced
sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu124-v20241224-debian-11
kind: compute#attachedDisk
mode: READ_WRITE
type: PERSISTENT
guestAccelerators:
- acceleratorCount: 1
acceleratorType: nvidia-l4
keyRevocationActionType: NONE
machineType: g2-standard-16
The idea was that prediction will fail due to too little GPU memory, then the pipeline on retry will use a different template with larger gpu, which I have specified separately as additional configuration to Nextflow. It seems the fail message gets lost (indeed most likely Nextflow issue) and the pipeline happily reports that "task has failed successfully", so no further retries are issued.