GPU not recognised
Description of the bug
I am running my pipeline on a system with a single RTX 4090:
nvidia-smi
Tue Jun 17 14:43:43 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:73:00.0 Off | Off |
| 0% 39C P8 28W / 450W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
My run times out because it uses the CPU instead of the GPU (even though I have use_gpu set to true).
Jun-17 05:50:43.544 [TaskFinalizer-6] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_PROTEINFOLD:COLABFOLD:COLABFOLD_BATCH (T1024)'
Caused by:
Process exceeded running time limit (8h)
Command executed:
ln -r -s params/alphafold_params_*/* params/
colabfold_batch \
--use-gpu-relax --amber --templates \
--num-recycle 3 \
--data $PWD \
--model-type alphafold2_ptm \
T1024.a3m \
$PWD
for i in `find *_relaxed_rank_001*.pdb`; do cp $i `echo $i | sed "s|_relaxed_rank_| |g" | cut -f1`"_colabfold.pdb"; done
for i in `find *.png -maxdepth 0`; do cp $i ${i%'.png'}_mqc.png; done
cat <<-END_VERSIONS > versions.yml
"NFCORE_PROTEINFOLD:COLABFOLD:COLABFOLD_BATCH":
colabfold_batch: 1.5.2
END_VERSIONS
Command exit status:
-
Command output:
2025-06-16 21:53:08,416 Running colabfold 1.5.5 (1648d2335943f9a483b6a803ebaea3e76162c788)
2025-06-16 21:53:08,749 WARNING: no GPU detected, will be using CPU
2025-06-16 21:53:08,823 Matplotlib created a temporary cache directory at /tmp/matplotlib-nphslddv because the default path (/home/sibbe/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2025-06-16 21:53:09,279 generated new fontManager
2025-06-16 21:53:10,012 Found 9 citations for tools or databases
2025-06-16 21:53:10,013 Query 1/1: T1024 (length 408)
Command error:
INFO: Converting SIF file to temporary sandbox...
WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (350) bind mounts
0%| | 0/150 [elapsed: 00:00 remaining: ?]
SUBMIT: 0%| | 0/150 [elapsed: 00:00 remaining: ?]
COMPLETE: 0%| | 0/150 [elapsed: 00:00 remaining: ?]
COMPLETE: 100%|██████████| 150/150 [elapsed: 00:00 remaining: 00:00]
COMPLETE: 100%|██████████| 150/150 [elapsed: 00:01 remaining: 00:00]
Work dir:
/home/sibbe/scratch/work/02/5e952845978d419edfceb6deb0e34d
Container:
/home/sibbe/scratch/work/singularity/quay.io-nf-core-proteinfold_colabfold-1.1.1.img
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
Jun-17 05:50:43.552 [TaskFinalizer-6] INFO nextflow.Session - Execution cancelled -- Finishing pending tasks before exit
Jun-17 05:50:43.568 [Actor Thread 70] DEBUG nextflow.sort.BigSort - Sort completed -- entries: 7; slices: 1; internal sort time: 0.01 s; external sort time: 0.004 s; total time: 0.014 s
Jun-17 05:50:43.576 [Actor Thread 70] DEBUG nextflow.file.FileCollector - >> temp file exists? false
Jun-17 05:50:43.577 [Actor Thread 70] DEBUG nextflow.file.FileCollector - Missed collect-file cache -- cause: java.nio.file.NoSuchFileException: /home/sibbe/scratch/work/collect-file/f21e6a94c90aca6e96ff1aacd86fd989
Jun-17 05:50:43.593 [TaskFinalizer-6] ERROR nextflow.Nextflow - Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting
Jun-17 05:50:43.594 [main] DEBUG nextflow.Session - Session await > all processes finished
Jun-17 05:50:43.595 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: local) - terminating tasks monitor poll loop
Jun-17 05:50:43.595 [main] DEBUG nextflow.Session - Session await > all barriers passed
Jun-17 05:50:43.596 [Actor Thread 70] DEBUG nextflow.file.FileCollector - Saved collect-files list to: /home/sibbe/scratch/work/collect-file/f21e6a94c90aca6e96ff1aacd86fd989
Jun-17 05:50:43.601 [Actor Thread 70] DEBUG nextflow.file.FileCollector - Deleting file collector temp dir: /tmp/nxf-13813534474444119490
Jun-17 05:50:43.611 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'TaskFinalizer' shutdown completed (hard=false)
Jun-17 05:50:43.613 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'PublishDir' shutdown completed (hard=false)
Jun-17 05:50:43.626 [main] INFO nextflow.Nextflow - -[nf-core/proteinfold] Pipeline completed with errors-
Jun-17 05:50:43.643 [main] DEBUG n.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=5; failedCount=1; ignoredCount=0; cachedCount=8; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=18d 20h 37m 52s; failedDuration=2d; cachedDuration=12h 21m 6s;loadCpus=0; loadMemory=0; peakRunning=2; peakCpus=128; peakMemory=224 GB; ]
Jun-17 05:50:43.643 [main] DEBUG nextflow.trace.TraceFileObserver - Workflow completed -- saving trace file
Jun-17 05:50:43.646 [main] DEBUG nextflow.trace.ReportObserver - Workflow completed -- rendering execution report
Jun-17 05:50:45.001 [main] DEBUG nextflow.trace.TimelineObserver - Workflow completed -- rendering execution timeline
Jun-17 05:50:45.475 [main] DEBUG nextflow.cache.CacheDB - Closing CacheDB done
Jun-17 05:50:45.529 [main] INFO org.pf4j.AbstractPluginManager - Stop plugin '[email protected]'
Jun-17 05:50:45.530 [main] DEBUG nextflow.plugin.BasePlugin - Plugin stopped nf-validation
Jun-17 05:50:45.532 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'FileTransfer' shutdown completed (hard=false)
Jun-17 05:50:45.533 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye
Command used and terminal output
nextflow run nf-core/proteinfold \
--input samples.csv \
--outdir results \
--mode colabfold \
--colabfold_server local \
--full_dbs true \
--colabfold_model_preset "alphafold2_ptm" \
--use_gpu true \
-resume \
-profile singularity \
-c config.config
cat config.config
process {
withName: /NFCORE_PROTEINFOLD:PREPARE_COLABFOLD_DBS:ARIA2_.+:ARIA2/ {
time = '12h'
}
withName: MMSEQS_COLABFOLDSEARCH {
memory = '112 GB'
cpus = 64
}
}
Relevant files
No response
System information
No response
When I check the availability of the gpu from the container, it is seen
sibbe@binfgpu2:~:singularity shell --nv docker://quay.io/nf-core/proteinfold_colabfold:1.1.1
INFO: Converting OCI blobs to SIF format
WARNING: 'nodev' mount option set on /home/sibbe, it could be a source of failure during build process
INFO: Starting build...
Getting image source signatures
Copying blob 88736512a147 done
Copying blob b8138df4e1ba done
Copying blob fd8c8e606d5a done
Copying blob f71ae62ecc99 done
Copying blob 31aa59d3d040 done
Copying blob 49c3bdf35d0b done
Copying blob e539c791da43 done
Copying blob 30441c7be8c8 done
Copying blob 01eee1872ba2 done
Copying blob a6959b6b748e done
Copying blob 4f939c70e6f9 done
Copying blob 59d1161a1fa7 done
Copying blob 0c33efcececc done
Copying blob e80231677c9e done
Copying config 8b9d9e998c done
Writing manifest to image destination
Storing signatures
2025/06/17 15:36:18 info unpack layer: sha256:88736512a147458c580cd28c969698561f236abba2ef04dbf0d7940cb3d7375e
2025/06/17 15:36:20 warn xattr{etc/gshadow} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
2025/06/17 15:36:20 warn xattr{/home/sibbe/NOBINFBACKUP/singularity/tmp/build-temp-866151607/rootfs/etc/gshadow} destination filesystem does not support xattrs, further warnings will be suppressed
2025/06/17 15:37:14 info unpack layer: sha256:b8138df4e1ba06fb27eca8c5f9e8cbec505ace92b6a39d95a9cdb3dd41d9e27c
2025/06/17 15:37:28 warn xattr{var/log/apt/term.log} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
2025/06/17 15:37:28 warn xattr{/home/sibbe/NOBINFBACKUP/singularity/tmp/build-temp-866151607/rootfs/var/log/apt/term.log} destination filesystem does not support xattrs, further warnings will be suppressed
2025/06/17 15:37:29 info unpack layer: sha256:fd8c8e606d5ac9cfd50e26b29605966edb95130840944fc2b7f446c821a39d9b
2025/06/17 15:37:31 warn xattr{var/log/apt/term.log} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
2025/06/17 15:37:31 warn xattr{/home/sibbe/NOBINFBACKUP/singularity/tmp/build-temp-866151607/rootfs/var/log/apt/term.log} destination filesystem does not support xattrs, further warnings will be suppressed
2025/06/17 15:37:31 info unpack layer: sha256:f71ae62ecc99cfc26782bded9ed4f452177c5e1becae5f3c8ca466866b92d8bc
2025/06/17 15:37:31 info unpack layer: sha256:31aa59d3d040dc3e3f742fbe5c8a5b35ec2894f5d86b7b00a7537b2eff5e92e0
2025/06/17 15:37:31 info unpack layer: sha256:49c3bdf35d0b8ce31ffa3b812ed961d8b6944a22c27303cedb06d54fcc38f8be
2025/06/17 15:38:23 warn xattr{var/log/apt/term.log} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
2025/06/17 15:38:23 warn xattr{/home/sibbe/NOBINFBACKUP/singularity/tmp/build-temp-866151607/rootfs/var/log/apt/term.log} destination filesystem does not support xattrs, further warnings will be suppressed
2025/06/17 15:38:23 info unpack layer: sha256:e539c791da438bc8d926ed9b3e18a9ba518a24bb2fdf334a41a89b71fcdd97e0
2025/06/17 15:38:23 info unpack layer: sha256:30441c7be8c84c515a7b22e898cc779fbc1c35937d0a074882ab6739a149ebf8
2025/06/17 15:38:58 warn xattr{var/log/apt/term.log} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
2025/06/17 15:38:58 warn xattr{/home/sibbe/NOBINFBACKUP/singularity/tmp/build-temp-866151607/rootfs/var/log/apt/term.log} destination filesystem does not support xattrs, further warnings will be suppressed
2025/06/17 15:38:58 info unpack layer: sha256:01eee1872ba2e4408b2d0e3b9391d59a237367eb089d70e88ac7cd9cef402bc3
2025/06/17 15:38:58 info unpack layer: sha256:a6959b6b748e70f00e5fbe6eee90908f1b3848083d26f3052d351cb0fe808f55
2025/06/17 15:38:59 warn xattr{etc/gshadow} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
2025/06/17 15:38:59 warn xattr{/home/sibbe/NOBINFBACKUP/singularity/tmp/build-temp-866151607/rootfs/etc/gshadow} destination filesystem does not support xattrs, further warnings will be suppressed
2025/06/17 15:43:13 info unpack layer: sha256:4f939c70e6f9d7f4361c8c2a44a7af54b2e6f041e932b2184df430f3de13d685
^[[1;3D^[[1;3C^[[1;3C^[[1;3C^[[1;3C2025/06/17 16:43:30 info unpack layer: sha256:59d1161a1fa70c753f21c3f0dc522a919df5130e7aa90765dbe4aa77abedda39
2025/06/17 16:57:45 info unpack layer: sha256:0c33efcececc8d17e23b398474a9dbe18dc09660140a2af987d7b970aee1dfa5
2025/06/17 16:57:45 info unpack layer: sha256:e80231677c9e7920e6b629f9f5e89a63ef3aaacaeccbf31ff50e0a93d586b3de
INFO: Creating SIF file...
INFO: Converting SIF file to temporary sandbox...
WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (350) bind mounts
Singularity> nvidia-smi
Tue Jun 17 18:04:02 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:17:00.0 Off | Off |
| 30% 32C P8 26W / 300W | 17082MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:73:00.0 Off | Off |
| 30% 32C P8 28W / 300W | 5MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 352764 C ...da3/envs/gpu_cell_oracle/bin/python 17052MiB |
+-----------------------------------------------------------------------------------------+
Could you check that the singularity command used to run the process uses --nv? you cand do it with the command below:
grep singularity /home/sibbe/scratch/work/02/5e952845978d419edfceb6deb0e34d/.command.run
This should show you the actual singularity command.
Another possible reason why the GPUs are not detected is that in your HPC system you may have any of these two environmental variable CUDA_VISIBLE_DEVICES or NVIDIA_VISIBLE_DEVICES set, to make them visible when running the process you need to create a custom Nextflow config e.g. custom.config that contains:
singularity.envWhitelist = "CUDA_VISIBLE_DEVICES,NVIDIA_VISIBLE_DEVICES
And add to the nextflow command that you use to launch the pipeline: -c custom.config
Just read your comment again, you can add singularity.envWhitelist to your config.config
I did the same. My GPU was recognised when I ran the pipeline with mode "alphafold2" but not with "colabfold".
My nextflow.config looks like...
username=System.getenv('USER')
cleanup = 'true'
workDir="/data/users/${username}/Nextflow/Work"
launchDir = projectDir
env{
SINGULARITY_TMPDIR = '/tmp'
MPLCONFIGDIR = '/tmp'
NUMBA_CACHE_DIR = '/tmp'
MAFFT_TMPDIR = '/tmp'
}
singularity{
enabled = true
autoMounts = false
pullTimeout = "1h"
cacheDir = "/data/users/${username}/Nextflow/SingularityCache"
runOptions = "--nv -B /data/users/${username}:/data/users/${username} -B /data/users/databases:/data/users/databases -B \${TMP_LOCAL}:/tmp -B /cm/local/apps -B /home/${username}:/home/${username}"
envWhitelist = ['SINGULARITY_TMPDIR','MPLCONFIGDIR','NUMBA_CACHE_DIR', 'MAFFT_TMPDIR','LD_LIBRARY_PATH', 'CUDA_VISIBLE_DEVICES','NVIDIA_VISIBLE_DEVICES']
}
process {
executor = 'slurm'
queue = 'gpu'
clusterOptions = { " --gres=gpu:1" }
maxForks = 50
submitRateLimit = '5sec'
beforeScript = '''
module load singularity
'''
}
My yaml-file looks like...
input: '/data/users/robin.garcia/AlphaFold2/laccase_samplesheet.csv'
outdir: '/data/users/robin.garcia/AlphaFold2/my_out_lacc_colabfold/'
mode: 'colabfold'
colabfold_server: 'local'
colabfold_model_preset: 'alphafold2_ptm'
num_recycles_colabfold: 3
use_gpu: false
email: '[email protected]'
# Custom database paths
colabfold_db: '/data/users/robin.garcia/AlphaFold2/colabfold_DB'
colabfold_alphafold2_params_path: '/data/users/robin.garcia/AlphaFold2/colabfold_DB/params/alphafold_params_2021-07-14'
When I run it with the CPU, it works. But then it takes a very long time.
Sorry for the late response...
In your yaml you have use_gpu: false and should be true
Hey! Since I see this issue still open:
I had the same error with a docker profile, but I found out that it is an problem with the nvidia-container-toolkit default configuration.
This is the post I found.
You have to set the no-cgroups variable to false in /etc/nvidia-container-runtime/config.toml (path might depend on your system). After chatting out the issue with Claude it seems that this variable is set to true by default for systems that are starting to get older, and newer systems use another protocol.
Hope this works out for you too, or for anybody having this same issue!