Insufficient shared memory during boltz Seqera execution
Description of the bug
Execution of the pipeline in boltz mode through Seqera fails on files with more than 50ish amino acids for a system with 24Gb VRAM. I was able to reproduce the issue during a local execution, and to fix it by modifying the runOptions parameter of the docker profile to include "--ipc=host". This seems to work, but may be unstable because I encountered some sporadic CUDA pin memory errors. Adding this profile modification to the Seqera config field didn't help, but Im not sure if docker is used at all. Tiny inputs are alright and complete successfully (multimeric entry of two chains each with 25AA), but anything larger (two chains of 300 AA), which is not an issue for 24Gb VRAM when running boltz natively, is failing.
Command used and terminal output
Command issued by seqera
nextflow run 'https://github.com/nf-core/proteinfold'
-name astonishing_ekeblad
-params-file 'https://api.cloud.seqera.io/ephemeral/pm6L6_hGHMlFrU0vvnjeuA.json'
-with-tower
-r dev
-profile docker
Here is the command used for local execution
nextflow run nf-core/proteinfold \
--input samplesheet.csv \
--outdir test_local \
--mode boltz \
--boltz_ccd_path ~/.boltz/ccd.pkl \
--boltz_model_path ~/.boltz/boltz1_conf.ckpt \
--use_gpu true \
-profile docker \
-r dev
And the custom configuration that fixes the issue locally (added to nextflow.config in execution dir)
profiles{
docker {
docker.runOptions = params.use_gpu ? '--gpus all --ipc=host' : '-u $(id -u):$(id -g)'
}
}
Relevant files
System information
- Nextflow 25.04.2 build 5947
- Google Cloud
- Executor = 'google-batch'
- container quay.io/nf-core/proteinfold_boltz:dev
- OS Linux DeepLearning VM with CUDA preinstalled
- nf-core/proteinfold 1.2.0dev
Template of the custom machine on google
kind: compute#instanceTemplate
name: gpu-24gb-l4
properties:
canIpForward: false
confidentialInstanceConfig:
enableConfidentialCompute: false
description: ''
disks:
- autoDelete: true
boot: true
deviceName: sp-test
index: 0
initializeParams:
diskSizeGb: '200'
diskType: pd-balanced
sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu124-v20241224-debian-11
kind: compute#attachedDisk
mode: READ_WRITE
type: PERSISTENT
guestAccelerators:
- acceleratorCount: 1
acceleratorType: nvidia-l4
keyRevocationActionType: NONE
machineType: g2-standard-16
I have identified a solution, although I have not been able to understand why the problem has persisted. I have checked running VM nodes created by Seqera and indeed without this fix the video memory usage maxes out at ~3Gb after which the execution gets aborted. This is very likely a Seqera specific issue and not something related to the pipeline.
Regardless, it can be good idea to include this solution in the RUN_BOLTZ process configuration:
containerOptions = '--ipc=host'
It seems that passing it through the docker.runOptions was not sufficient, maybe it has to do something with how wave containers are initialised.
I believe this issue can be closed.