GPU solver doesn't work on cluster with A100 GPU

Open qianggao-lab opened this issue 8 months ago • 1 comments

Hi, I am trying to solve a large-scale SDP using SCS, which converges too slowly on the CPU. So, I want to use the GPU version to get some speedup. I first tested the GPU version with a small problem on my laptop (Windows 11) with an RTX 4080 GPU, which works perfectly:

------------------------------------------------------------------
	       SCS v3.2.5 - Splitting Conic Solver
	(c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem:  variables n: 4875, constraints m: 594271
cones: 	  z: primal zero / dual free vars: 31
	  l: linear vars: 30
	  q: soc vars: 0, qsize: 1
	  s: psd vars: 594210, ssize: 182
settings: eps_abs: 1.0e-04, eps_rel: 1.0e-04, eps_infeas: 1.0e-07
	  alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
	  max_iters: 100000, normalize: 1, rho_x: 1.00e-06
	  acceleration_lookback: 10, acceleration_interval: 10
lin-sys:  sparse-indirect GPU
	  nnz(A): 138701, nnz(P): 0
------------------------------------------------------------------
 iter | pri res | dua res |   gap   |   obj   |  scale  | time (s)
------------------------------------------------------------------
     0| 7.11e+00  6.71e+01  4.04e+02 -3.49e+02  1.00e-01  2.06e+00 
   250| 7.03e-04  5.26e-04  1.46e-03  4.16e-02  3.14e-02  1.11e+02 
   475| 5.59e-05  2.10e-04  7.97e-05  4.22e-02  3.14e-02  1.90e+02 
------------------------------------------------------------------
status:  solved
timings: total: 1.90e+02s = setup: 2.54e-01s + solve: 1.90e+02s
	 lin-sys: 1.26e+02s, cones: 5.84e+01s, accel: 2.88e-01s
------------------------------------------------------------------
objective = 0.042202
------------------------------------------------------------------

This is calling SCS via YALMIP in MATLAB R2023b. The CUDA version is 12.9. Insider MATLAB it shows

>> gpuDevice

ans = 

  CUDADevice with properties:

                      Name: 'NVIDIA GeForce RTX 4080 Laptop GPU'
                     Index: 1
         ComputeCapability: '8.9'
            SupportsDouble: 1
     GraphicsDriverVersion: '576.02'
               DriverModel: 'WDDM'
            ToolkitVersion: 11.8000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152 (49.15 KB)
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 12878086144 (12.88 GB)
           AvailableMemory: 11573702656 (11.57 GB)
               CachePolicy: 'balanced'
       MultiprocessorCount: 58
              ClockRateKHz: 1665000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
           DeviceAvailable: 1
            DeviceSelected: 1

However, when I tried the same procedure on a cluster with an A100 GPU, the solver didn't even run (just showed "-------------"). I just replaced the path to the CUDA folder (compile_gpu.m from scs-matlab) with the corresponding one on the cluster. System information:

$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="8.9 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.9"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.9 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2029-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
ROCKY_SUPPORT_PRODUCT_VERSION="8.9"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.9"

I am using MATLAB R2022b and

>> gpuDevice

ans = 

  CUDADevice with properties:

                      Name: 'NVIDIA A100-SXM4-40GB MIG 3g.20gb'
                     Index: 1
         ComputeCapability: '8.0'
            SupportsDouble: 1
             DriverVersion: 12.5000
            ToolkitVersion: 11.2000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152 (49.15 KB)
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 21072183296 (21.07 GB)
           AvailableMemory: 20629553152 (20.63 GB)
       MultiprocessorCount: 42
              ClockRateKHz: 1410000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 0
          CanMapHostMemory: 1
           DeviceSupported: 1
           DeviceAvailable: 1
            DeviceSelected: 1

I have the following modules loaded on the cluster

module list

Currently Loaded Modules:
  1) gmp/6.3.0-fasrc01   2) mpfr/4.2.1-fasrc01   3) mpc/1.3.1-fasrc02   4) cuda/12.4.1-fasrc01   5) gcc/14.2.0-fasrc01

I already tried using multi-core CPUs, which take far too long to converge. The possible GPU acceleration might be the only way for me to go.

May 11 '25 05:05 qianggao-lab

We have a new faster (direct) gpu linear system solver, though I don't think it is plumbed through to the matlab interface yet unfortunately.

Jul 31 '25 15:07 bodono