proteinfold icon indicating copy to clipboard operation
proteinfold copied to clipboard

GPU not recognised

Open Luke-ebbis opened this issue 7 months ago • 6 comments

Description of the bug

I am running my pipeline on a system with a single RTX 4090:

nvidia-smi 
Tue Jun 17 14:43:43 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:73:00.0 Off |                  Off |
|  0%   39C    P8             28W /  450W |       2MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

My run times out because it uses the CPU instead of the GPU (even though I have use_gpu set to true).

Jun-17 05:50:43.544 [TaskFinalizer-6] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_PROTEINFOLD:COLABFOLD:COLABFOLD_BATCH (T1024)'

Caused by:
  Process exceeded running time limit (8h)


Command executed:

  ln -r -s params/alphafold_params_*/* params/
  colabfold_batch \
      --use-gpu-relax --amber --templates \
      --num-recycle 3 \
      --data $PWD \
      --model-type alphafold2_ptm \
      T1024.a3m \
      $PWD
  for i in `find *_relaxed_rank_001*.pdb`; do cp $i `echo $i | sed "s|_relaxed_rank_|	|g" | cut -f1`"_colabfold.pdb"; done
  for i in `find *.png -maxdepth 0`; do cp $i ${i%'.png'}_mqc.png; done
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PROTEINFOLD:COLABFOLD:COLABFOLD_BATCH":
      colabfold_batch: 1.5.2
  END_VERSIONS

Command exit status:
  -

Command output:
  2025-06-16 21:53:08,416 Running colabfold 1.5.5 (1648d2335943f9a483b6a803ebaea3e76162c788)
  2025-06-16 21:53:08,749 WARNING: no GPU detected, will be using CPU
  2025-06-16 21:53:08,823 Matplotlib created a temporary cache directory at /tmp/matplotlib-nphslddv because the default path (/home/sibbe/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
  2025-06-16 21:53:09,279 generated new fontManager
  2025-06-16 21:53:10,012 Found 9 citations for tools or databases
  2025-06-16 21:53:10,013 Query 1/1: T1024 (length 408)

Command error:
  INFO:    Converting SIF file to temporary sandbox...
  WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (350) bind mounts
  
    0%|          | 0/150 [elapsed: 00:00 remaining: ?]
  SUBMIT:   0%|          | 0/150 [elapsed: 00:00 remaining: ?]
  COMPLETE:   0%|          | 0/150 [elapsed: 00:00 remaining: ?]
  COMPLETE: 100%|██████████| 150/150 [elapsed: 00:00 remaining: 00:00]
  COMPLETE: 100%|██████████| 150/150 [elapsed: 00:01 remaining: 00:00]

Work dir:
  /home/sibbe/scratch/work/02/5e952845978d419edfceb6deb0e34d

Container:
  /home/sibbe/scratch/work/singularity/quay.io-nf-core-proteinfold_colabfold-1.1.1.img

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
Jun-17 05:50:43.552 [TaskFinalizer-6] INFO  nextflow.Session - Execution cancelled -- Finishing pending tasks before exit
Jun-17 05:50:43.568 [Actor Thread 70] DEBUG nextflow.sort.BigSort - Sort completed -- entries: 7; slices: 1; internal sort time: 0.01 s; external sort time: 0.004 s; total time: 0.014 s
Jun-17 05:50:43.576 [Actor Thread 70] DEBUG nextflow.file.FileCollector - >> temp file exists? false
Jun-17 05:50:43.577 [Actor Thread 70] DEBUG nextflow.file.FileCollector - Missed collect-file cache -- cause: java.nio.file.NoSuchFileException: /home/sibbe/scratch/work/collect-file/f21e6a94c90aca6e96ff1aacd86fd989
Jun-17 05:50:43.593 [TaskFinalizer-6] ERROR nextflow.Nextflow - Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting
Jun-17 05:50:43.594 [main] DEBUG nextflow.Session - Session await > all processes finished
Jun-17 05:50:43.595 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: local) - terminating tasks monitor poll loop
Jun-17 05:50:43.595 [main] DEBUG nextflow.Session - Session await > all barriers passed
Jun-17 05:50:43.596 [Actor Thread 70] DEBUG nextflow.file.FileCollector - Saved collect-files list to: /home/sibbe/scratch/work/collect-file/f21e6a94c90aca6e96ff1aacd86fd989
Jun-17 05:50:43.601 [Actor Thread 70] DEBUG nextflow.file.FileCollector - Deleting file collector temp dir: /tmp/nxf-13813534474444119490
Jun-17 05:50:43.611 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'TaskFinalizer' shutdown completed (hard=false)
Jun-17 05:50:43.613 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'PublishDir' shutdown completed (hard=false)
Jun-17 05:50:43.626 [main] INFO  nextflow.Nextflow - -[nf-core/proteinfold] Pipeline completed with errors-
Jun-17 05:50:43.643 [main] DEBUG n.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=5; failedCount=1; ignoredCount=0; cachedCount=8; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=18d 20h 37m 52s; failedDuration=2d; cachedDuration=12h 21m 6s;loadCpus=0; loadMemory=0; peakRunning=2; peakCpus=128; peakMemory=224 GB; ]
Jun-17 05:50:43.643 [main] DEBUG nextflow.trace.TraceFileObserver - Workflow completed -- saving trace file
Jun-17 05:50:43.646 [main] DEBUG nextflow.trace.ReportObserver - Workflow completed -- rendering execution report
Jun-17 05:50:45.001 [main] DEBUG nextflow.trace.TimelineObserver - Workflow completed -- rendering execution timeline
Jun-17 05:50:45.475 [main] DEBUG nextflow.cache.CacheDB - Closing CacheDB done
Jun-17 05:50:45.529 [main] INFO  org.pf4j.AbstractPluginManager - Stop plugin '[email protected]'
Jun-17 05:50:45.530 [main] DEBUG nextflow.plugin.BasePlugin - Plugin stopped nf-validation
Jun-17 05:50:45.532 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'FileTransfer' shutdown completed (hard=false)
Jun-17 05:50:45.533 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye

Command used and terminal output

nextflow run nf-core/proteinfold \
      --input samples.csv \
      --outdir results \
      --mode colabfold \
      --colabfold_server local \
      --full_dbs true \
      --colabfold_model_preset "alphafold2_ptm" \
      --use_gpu true \
      -resume \
      -profile singularity \
      -c config.config

cat config.config

process { 
  withName: /NFCORE_PROTEINFOLD:PREPARE_COLABFOLD_DBS:ARIA2_.+:ARIA2/ {
    time = '12h'
  }

  withName: MMSEQS_COLABFOLDSEARCH {
    memory = '112 GB'
    cpus = 64
  }

}

Relevant files

No response

System information

No response

Luke-ebbis avatar Jun 17 '25 12:06 Luke-ebbis

When I check the availability of the gpu from the container, it is seen

sibbe@binfgpu2:~:singularity shell  --nv docker://quay.io/nf-core/proteinfold_colabfold:1.1.1
INFO:    Converting OCI blobs to SIF format
WARNING: 'nodev' mount option set on /home/sibbe, it could be a source of failure during build process
INFO:    Starting build...
Getting image source signatures
Copying blob 88736512a147 done  
Copying blob b8138df4e1ba done  
Copying blob fd8c8e606d5a done  
Copying blob f71ae62ecc99 done  
Copying blob 31aa59d3d040 done  
Copying blob 49c3bdf35d0b done  
Copying blob e539c791da43 done  
Copying blob 30441c7be8c8 done  
Copying blob 01eee1872ba2 done  
Copying blob a6959b6b748e done  
Copying blob 4f939c70e6f9 done  
Copying blob 59d1161a1fa7 done  
Copying blob 0c33efcececc done  
Copying blob e80231677c9e done  
Copying config 8b9d9e998c done  
Writing manifest to image destination
Storing signatures
2025/06/17 15:36:18  info unpack layer: sha256:88736512a147458c580cd28c969698561f236abba2ef04dbf0d7940cb3d7375e
2025/06/17 15:36:20  warn xattr{etc/gshadow} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
2025/06/17 15:36:20  warn xattr{/home/sibbe/NOBINFBACKUP/singularity/tmp/build-temp-866151607/rootfs/etc/gshadow} destination filesystem does not support xattrs, further warnings will be suppressed
2025/06/17 15:37:14  info unpack layer: sha256:b8138df4e1ba06fb27eca8c5f9e8cbec505ace92b6a39d95a9cdb3dd41d9e27c
2025/06/17 15:37:28  warn xattr{var/log/apt/term.log} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
2025/06/17 15:37:28  warn xattr{/home/sibbe/NOBINFBACKUP/singularity/tmp/build-temp-866151607/rootfs/var/log/apt/term.log} destination filesystem does not support xattrs, further warnings will be suppressed
2025/06/17 15:37:29  info unpack layer: sha256:fd8c8e606d5ac9cfd50e26b29605966edb95130840944fc2b7f446c821a39d9b
2025/06/17 15:37:31  warn xattr{var/log/apt/term.log} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
2025/06/17 15:37:31  warn xattr{/home/sibbe/NOBINFBACKUP/singularity/tmp/build-temp-866151607/rootfs/var/log/apt/term.log} destination filesystem does not support xattrs, further warnings will be suppressed
2025/06/17 15:37:31  info unpack layer: sha256:f71ae62ecc99cfc26782bded9ed4f452177c5e1becae5f3c8ca466866b92d8bc
2025/06/17 15:37:31  info unpack layer: sha256:31aa59d3d040dc3e3f742fbe5c8a5b35ec2894f5d86b7b00a7537b2eff5e92e0
2025/06/17 15:37:31  info unpack layer: sha256:49c3bdf35d0b8ce31ffa3b812ed961d8b6944a22c27303cedb06d54fcc38f8be
2025/06/17 15:38:23  warn xattr{var/log/apt/term.log} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
2025/06/17 15:38:23  warn xattr{/home/sibbe/NOBINFBACKUP/singularity/tmp/build-temp-866151607/rootfs/var/log/apt/term.log} destination filesystem does not support xattrs, further warnings will be suppressed
2025/06/17 15:38:23  info unpack layer: sha256:e539c791da438bc8d926ed9b3e18a9ba518a24bb2fdf334a41a89b71fcdd97e0
2025/06/17 15:38:23  info unpack layer: sha256:30441c7be8c84c515a7b22e898cc779fbc1c35937d0a074882ab6739a149ebf8
2025/06/17 15:38:58  warn xattr{var/log/apt/term.log} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
2025/06/17 15:38:58  warn xattr{/home/sibbe/NOBINFBACKUP/singularity/tmp/build-temp-866151607/rootfs/var/log/apt/term.log} destination filesystem does not support xattrs, further warnings will be suppressed
2025/06/17 15:38:58  info unpack layer: sha256:01eee1872ba2e4408b2d0e3b9391d59a237367eb089d70e88ac7cd9cef402bc3
2025/06/17 15:38:58  info unpack layer: sha256:a6959b6b748e70f00e5fbe6eee90908f1b3848083d26f3052d351cb0fe808f55
2025/06/17 15:38:59  warn xattr{etc/gshadow} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
2025/06/17 15:38:59  warn xattr{/home/sibbe/NOBINFBACKUP/singularity/tmp/build-temp-866151607/rootfs/etc/gshadow} destination filesystem does not support xattrs, further warnings will be suppressed
2025/06/17 15:43:13  info unpack layer: sha256:4f939c70e6f9d7f4361c8c2a44a7af54b2e6f041e932b2184df430f3de13d685
^[[1;3D^[[1;3C^[[1;3C^[[1;3C^[[1;3C2025/06/17 16:43:30  info unpack layer: sha256:59d1161a1fa70c753f21c3f0dc522a919df5130e7aa90765dbe4aa77abedda39
2025/06/17 16:57:45  info unpack layer: sha256:0c33efcececc8d17e23b398474a9dbe18dc09660140a2af987d7b970aee1dfa5
2025/06/17 16:57:45  info unpack layer: sha256:e80231677c9e7920e6b629f9f5e89a63ef3aaacaeccbf31ff50e0a93d586b3de
INFO:    Creating SIF file...
INFO:    Converting SIF file to temporary sandbox...
WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (350) bind mounts
Singularity> nvidia-smi 
Tue Jun 17 18:04:02 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000               Off |   00000000:17:00.0 Off |                  Off |
| 30%   32C    P8             26W /  300W |   17082MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A6000               Off |   00000000:73:00.0 Off |                  Off |
| 30%   32C    P8             28W /  300W |       5MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    352764      C   ...da3/envs/gpu_cell_oracle/bin/python      17052MiB |
+-----------------------------------------------------------------------------------------+

Luke-ebbis avatar Jun 17 '25 16:06 Luke-ebbis

Could you check that the singularity command used to run the process uses --nv? you cand do it with the command below: grep singularity /home/sibbe/scratch/work/02/5e952845978d419edfceb6deb0e34d/.command.run This should show you the actual singularity command. Another possible reason why the GPUs are not detected is that in your HPC system you may have any of these two environmental variable CUDA_VISIBLE_DEVICES or NVIDIA_VISIBLE_DEVICES set, to make them visible when running the process you need to create a custom Nextflow config e.g. custom.config that contains:

singularity.envWhitelist = "CUDA_VISIBLE_DEVICES,NVIDIA_VISIBLE_DEVICES

And add to the nextflow command that you use to launch the pipeline: -c custom.config

JoseEspinosa avatar Jul 02 '25 09:07 JoseEspinosa

Just read your comment again, you can add singularity.envWhitelist to your config.config

JoseEspinosa avatar Jul 02 '25 09:07 JoseEspinosa

I did the same. My GPU was recognised when I ran the pipeline with mode "alphafold2" but not with "colabfold".

My nextflow.config looks like...

username=System.getenv('USER')
cleanup = 'true'
workDir="/data/users/${username}/Nextflow/Work"
launchDir = projectDir

env{
	SINGULARITY_TMPDIR = '/tmp'
	MPLCONFIGDIR = '/tmp'	
	NUMBA_CACHE_DIR = '/tmp'
	MAFFT_TMPDIR = '/tmp'
}

singularity{
    enabled = true
	autoMounts = false
	pullTimeout = "1h"
	cacheDir = "/data/users/${username}/Nextflow/SingularityCache"
	runOptions = "--nv -B /data/users/${username}:/data/users/${username} -B /data/users/databases:/data/users/databases -B \${TMP_LOCAL}:/tmp -B /cm/local/apps -B /home/${username}:/home/${username}"
	envWhitelist = ['SINGULARITY_TMPDIR','MPLCONFIGDIR','NUMBA_CACHE_DIR', 'MAFFT_TMPDIR','LD_LIBRARY_PATH', 'CUDA_VISIBLE_DEVICES','NVIDIA_VISIBLE_DEVICES']
}
process {
	executor = 'slurm'
	queue = 'gpu'
	clusterOptions = { " --gres=gpu:1" }
	maxForks = 50
	submitRateLimit = '5sec'
	beforeScript = '''
	module load singularity
	'''
}

My yaml-file looks like...

input: '/data/users/robin.garcia/AlphaFold2/laccase_samplesheet.csv'
outdir: '/data/users/robin.garcia/AlphaFold2/my_out_lacc_colabfold/'
mode: 'colabfold'
colabfold_server: 'local'
colabfold_model_preset: 'alphafold2_ptm'
num_recycles_colabfold: 3
use_gpu: false
email: '[email protected]'

# Custom database paths
colabfold_db: '/data/users/robin.garcia/AlphaFold2/colabfold_DB'
colabfold_alphafold2_params_path: '/data/users/robin.garcia/AlphaFold2/colabfold_DB/params/alphafold_params_2021-07-14'

When I run it with the CPU, it works. But then it takes a very long time.

robingarcia avatar Jul 07 '25 05:07 robingarcia

Sorry for the late response... In your yaml you have use_gpu: false and should be true

JoseEspinosa avatar Jul 29 '25 12:07 JoseEspinosa

Hey! Since I see this issue still open: I had the same error with a docker profile, but I found out that it is an problem with the nvidia-container-toolkit default configuration. This is the post I found. You have to set the no-cgroups variable to false in /etc/nvidia-container-runtime/config.toml (path might depend on your system). After chatting out the issue with Claude it seems that this variable is set to true by default for systems that are starting to get older, and newer systems use another protocol.

Hope this works out for you too, or for anybody having this same issue!

mosotelo avatar Nov 09 '25 20:11 mosotelo