Insufficient shared memory during boltz Seqera execution

Open ppenev-eligo opened this issue 7 months ago • 1 comments

Description of the bug

Execution of the pipeline in boltz mode through Seqera fails on files with more than 50ish amino acids for a system with 24Gb VRAM. I was able to reproduce the issue during a local execution, and to fix it by modifying the runOptions parameter of the docker profile to include "--ipc=host". This seems to work, but may be unstable because I encountered some sporadic CUDA pin memory errors. Adding this profile modification to the Seqera config field didn't help, but Im not sure if docker is used at all. Tiny inputs are alright and complete successfully (multimeric entry of two chains each with 25AA), but anything larger (two chains of 300 AA), which is not an issue for 24Gb VRAM when running boltz natively, is failing.

Command used and terminal output

Command issued by seqera

nextflow run 'https://github.com/nf-core/proteinfold'
		 -name astonishing_ekeblad
		 -params-file 'https://api.cloud.seqera.io/ephemeral/pm6L6_hGHMlFrU0vvnjeuA.json'
		 -with-tower
		 -r dev
		 -profile docker

Here is the command used for local execution

nextflow run nf-core/proteinfold \
    --input samplesheet.csv \
    --outdir test_local \
    --mode boltz \
    --boltz_ccd_path ~/.boltz/ccd.pkl \
    --boltz_model_path ~/.boltz/boltz1_conf.ckpt \
    --use_gpu true \
    -profile docker \
    -r dev

And the custom configuration that fixes the issue locally (added to nextflow.config in execution dir)

profiles{
    docker {
        docker.runOptions = params.use_gpu ? '--gpus all --ipc=host' : '-u $(id -u):$(id -g)'
    }
}

Relevant files

seqera_params.json

seqera_resolved_config.txt

task-50aa.command.log.txt

task-300aa.command.log.txt

System information

Nextflow 25.04.2 build 5947
Google Cloud
Executor = 'google-batch'
container quay.io/nf-core/proteinfold_boltz:dev
OS Linux DeepLearning VM with CUDA preinstalled
nf-core/proteinfold 1.2.0dev

Template of the custom machine on google

kind: compute#instanceTemplate
name: gpu-24gb-l4
properties:
  canIpForward: false
  confidentialInstanceConfig:
    enableConfidentialCompute: false
  description: ''
  disks:
  - autoDelete: true
    boot: true
    deviceName: sp-test
    index: 0
    initializeParams:
      diskSizeGb: '200'
      diskType: pd-balanced
      sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu124-v20241224-debian-11
    kind: compute#attachedDisk
    mode: READ_WRITE
    type: PERSISTENT
  guestAccelerators:
  - acceleratorCount: 1
    acceleratorType: nvidia-l4
  keyRevocationActionType: NONE
  machineType: g2-standard-16

Jun 04 '25 15:06 ppenev-eligo

I have identified a solution, although I have not been able to understand why the problem has persisted. I have checked running VM nodes created by Seqera and indeed without this fix the video memory usage maxes out at ~3Gb after which the execution gets aborted. This is very likely a Seqera specific issue and not something related to the pipeline.

Regardless, it can be good idea to include this solution in the RUN_BOLTZ process configuration: containerOptions = '--ipc=host'

It seems that passing it through the docker.runOptions was not sufficient, maybe it has to do something with how wave containers are initialised.

I believe this issue can be closed.

Aug 25 '25 13:08 ppenev-eligo