scs
scs copied to clipboard
./out/demo_socp_gpu fails to solve its problem
Specifications
- OS: Arch Linux
- SCS Version:
master
at 5be0e1684d12c4cfd4d22c5fba236a84a092ab5b - Compiler: gcc
Description
scs fails at solving ./out/demo_socp_gpu 1000 0.5 0.5 1
How to reproduce
linking against julia openblas:
JULIA_HOME="/opt/julias/julia-1.6"
JULIA_LD_PATH="$JULIA_HOME/lib/julia"
BLASLDFLAGS="-L$JULIA_LD_PATH -lopenblas64_"
SCSFLAGS="USE_OPENMP=1 BLAS64=1 BLASSUFFIX=_64_"
make -j4 CFLAGS="-march=native" DLONG=0 ${SCSFLAGS} BLASLDFLAGS="${BLASLDFLAGS}" gpu
then running it via
LD_LIBRARY_PATH=$JULIA_LD_PATH:$LD_LIBRARY_PATH ./out/demo_socp_gpu 1000 0.5 0.5 1
Additional information
similarly compiled direct and indirect solvers (cpu) work just fine
Output
seed : 1
A is 4000 by 1000, with 32 nonzeros per column.
A has 32000 nonzeros (0.800000% dense).
Nonzeros of A take 0.000238 GB of storage.
Row idxs of A take 0.000119 GB of storage.
Col ptrs of A take 0.000004 GB of storage.
ScsCone information:
Zero cone rows: 2000
LP cone rows: 2000
Number of second-order cones: 0, covering 0 rows, with sizes
[]
Number of rows covered is 4000 out of 4000.
true pri opt = 2022.070521
true dua opt = 2022.070521
------------------------------------------------------------------
SCS v3.0.0 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem: variables n: 1000, constraints m: 4000
cones: z: primal zero / dual free vars: 2000
l: linear vars: 2000
settings: eps_abs: 1.0e-04, eps_rel: 1.0e-04, eps_infeas: 1.0e-07
alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
max_iters: 100000, normalize: 1, warm_start: 0
acceleration_lookback: 10, acceleration_interval: 10
lin-sys: sparse-indirect GPU
nnz(A): 32000, nnz(P): 0
------------------------------------------------------------------
iter | pri res | dua res | gap | obj | scale | time (s)
------------------------------------------------------------------
0| 6.90e+00 9.46e+01 3.33e+04 -1.66e+04 1.00e-01 1.03e-03
250| 1.76e+04 4.31e+01 1.23e+04 -6.15e+03 1.00e-01 1.65e-01
500| 2.74e+04 4.29e+01 1.23e+04 -6.16e+03 1.00e-01 3.29e-01
750| 1.57e+04 4.26e+01 1.23e+04 -6.16e+03 1.00e-01 4.94e-01
1000| 1.64e+04 4.29e+01 1.23e+04 -6.16e+03 1.00e-01 6.85e-01
1250| 4.30e+21 2.67e+22 6.54e+22 -3.27e+22 1.00e-01 8.48e-01
1500| 1.90e+04 4.29e+01 1.23e+04 -6.16e+03 1.00e-01 9.48e-01
1750| 2.14e+04 4.29e+01 1.23e+04 -6.16e+03 1.00e-01 1.04e+00
2000| 2.48e+04 4.29e+01 1.23e+04 -6.16e+03 1.00e-01 1.13e+00
2250| 6.45e+20 2.19e+22 4.21e+22 2.11e+22 1.00e-01 1.22e+00
2500| 2.07e+04 4.29e+01 1.23e+04 -6.16e+03 1.00e-01 1.30e+00
2750| 2.53e+04 4.29e+01 1.23e+04 -6.16e+03 1.00e-01 1.39e+00
3000| 2.02e+04 4.29e+01 1.23e+04 -6.16e+03 1.00e-01 1.48e+00
3250| 5.72e+20 3.01e+22 3.73e+22 1.87e+22 1.00e-01 1.57e+00
3500| 2.09e+04 4.29e+01 1.23e+04 -6.16e+03 1.00e-01 1.66e+00
3750| 2.43e+04 4.29e+01 1.23e+04 -6.16e+03 1.00e-01 1.75e+00
4000| 2.31e+04 4.29e+01 1.23e+04 -6.16e+03 1.00e-01 1.84e+00
[ ... ]
99500| 2.48e+04 4.29e+01 1.23e+04 -6.16e+03 1.00e-01 3.65e+01
99750| 2.48e+04 4.29e+01 1.23e+04 -6.16e+03 1.00e-01 3.67e+01
100000| 2.48e+04 4.29e+01 1.23e+04 -6.16e+03 1.00e-01 3.68e+01
------------------------------------------------------------------
status: solved (inaccurate - reached max_iters)
timings: total: 3.68e+01s = setup: 5.47e-02s + solve: 3.68e+01s
lin-sys: 3.16e+01s, cones: 7.88e-01s, accel: 4.77e-01s
------------------------------------------------------------------
objective = -6159.028853 (inaccurate)
------------------------------------------------------------------
true pri opt = 2022.070521
true dua opt = 2022.070521
scs pri obj= 0.000000
scs dua obj = -12318.057707
Thanks for posting. I am unable to reproduce this, when I run the command I get:
2021-10-16 14:47:37 (base) 0 bodonoghue@bodonoghue-[]-~/git/scs:
└──[ins] => out/demo_socp_gpu_indirect 1000 0.5 0.5 1
seed : 1
A is 4000 by 1000, with 32 nonzeros per column.
A has 32000 nonzeros (0.800000% dense).
Nonzeros of A take 0.000238 GB of storage.
Row idxs of A take 0.000119 GB of storage.
Col ptrs of A take 0.000004 GB of storage.
ScsCone information:
Zero cone rows: 2000
LP cone rows: 2000
Number of second-order cones: 0, covering 0 rows, with sizes
[]
Number of rows covered is 4000 out of 4000.
true pri opt = 2022.070521
true dua opt = 2022.070521
------------------------------------------------------------------
SCS v3.0.0 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem: variables n: 1000, constraints m: 4000
cones: z: primal zero / dual free vars: 2000
l: linear vars: 2000
settings: eps_abs: 1.0e-04, eps_rel: 1.0e-04, eps_infeas: 1.0e-07
alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
max_iters: 100000, normalize: 1, warm_start: 0
acceleration_lookback: 10, acceleration_interval: 10
lin-sys: sparse-indirect GPU
nnz(A): 32000, nnz(P): 0
------------------------------------------------------------------
iter | pri res | dua res | gap | obj | scale | time (s)
------------------------------------------------------------------
0| 6.90e+00 7.44e+00 2.65e+02 3.90e+03 1.00e-01 2.11e-02
25| 3.80e-06 3.17e-04 3.36e-03 2.02e+03 1.00e-01 1.08e-01
------------------------------------------------------------------
status: solved
timings: total: 6.66e-01s = setup: 5.58e-01s + solve: 1.08e-01s
lin-sys: 8.57e-02s, cones: 2.84e-04s, accel: 6.22e-05s
------------------------------------------------------------------
objective = 2022.072100
------------------------------------------------------------------
true pri opt = 2022.070521
true dua opt = 2022.070521
scs pri obj= 2022.070419
scs dua obj = 2022.073782
It might be the case that you are missing the gpu fixes I submitted here: https://github.com/cvxgrp/scs/commit/13e675d8c1f17e8f1e184281b25b8196c4ac74da.
I did not cut a new release / tag with those fixes. Is that the issue?
By the way, you can better test the gpu using:
make purge
make test_gpu
out/run_tests_gpu_indirect
I'm on master
as of 5be0e1684d12c4cfd4d22c5fba236a84a092ab5b
I have CUDA_PATH=/opt/cuda
in my env pointing to cuda-11.4.2
.
I compiled scs with
make purge
make test_gpu
as advised and then test it with ./out/run_tests_gpu_indirect
. here is what I get:
cc -g -Wall -Wwrite-strings -pedantic -funroll-loops -Wstrict-prototypes -I. -Iinclude -Ilinsys -O3 -fPIC -DCTRLC=1 -DCOPYAMATRIX=1 -DGPU_TRANSPOSE_MAT=1 -DUSE_LAPACK -DINDIRECT=1 -c src/scs.c -o src/scs_indir.o
cc -g -Wall -Wwrite-strings -pedantic -funroll-loops -Wstrict-prototypes -I. -Iinclude -Ilinsys -O3 -fPIC -DCTRLC=1 -DCOPYAMATRIX=1 -DGPU_TRANSPOSE_MAT=1 -DUSE_LAPACK -c -o src/util.o src/util.c
cc -g -Wall -Wwrite-strings -pedantic -funroll-loops -Wstrict-prototypes -I. -Iinclude -Ilinsys -O3 -fPIC -DCTRLC=1 -DCOPYAMATRIX=1 -DGPU_TRANSPOSE_MAT=1 -DUSE_LAPACK -c -o src/cones.o src/cones.c
cc -g -Wall -Wwrite-strings -pedantic -funroll-loops -Wstrict-prototypes -I. -Iinclude -Ilinsys -O3 -fPIC -DCTRLC=1 -DCOPYAMATRIX=1 -DGPU_TRANSPOSE_MAT=1 -DUSE_LAPACK -c -o src/aa.o src/aa.c
cc -g -Wall -Wwrite-strings -pedantic -funroll-loops -Wstrict-prototypes -I. -Iinclude -Ilinsys -O3 -fPIC -DCTRLC=1 -DCOPYAMATRIX=1 -DGPU_TRANSPOSE_MAT=1 -DUSE_LAPACK -c -o src/rw.o src/rw.c
cc -g -Wall -Wwrite-strings -pedantic -funroll-loops -Wstrict-prototypes -I. -Iinclude -Ilinsys -O3 -fPIC -DCTRLC=1 -DCOPYAMATRIX=1 -DGPU_TRANSPOSE_MAT=1 -DUSE_LAPACK -c -o src/linalg.o src/linalg.c
cc -g -Wall -Wwrite-strings -pedantic -funroll-loops -Wstrict-prototypes -I. -Iinclude -Ilinsys -O3 -fPIC -DCTRLC=1 -DCOPYAMATRIX=1 -DGPU_TRANSPOSE_MAT=1 -DUSE_LAPACK -c -o src/ctrlc.o src/ctrlc.c
cc -g -Wall -Wwrite-strings -pedantic -funroll-loops -Wstrict-prototypes -I. -Iinclude -Ilinsys -O3 -fPIC -DCTRLC=1 -DCOPYAMATRIX=1 -DGPU_TRANSPOSE_MAT=1 -DUSE_LAPACK -c -o src/scs_version.o src/scs_version.c
cc -g -Wall -Wwrite-strings -pedantic -funroll-loops -Wstrict-prototypes -I. -Iinclude -Ilinsys -O3 -fPIC -DCTRLC=1 -DCOPYAMATRIX=1 -DGPU_TRANSPOSE_MAT=1 -DUSE_LAPACK -c -o src/normalize.o src/normalize.c
cc -c -o linsys/gpu/indirect/private.o linsys/gpu/indirect/private.c -g -Wall -Wwrite-strings -pedantic -funroll-loops -Wstrict-prototypes -I. -Iinclude -Ilinsys -O3 -fPIC -DCTRLC=1 -DCOPYAMATRIX=1 -DGPU_TRANSPOSE_MAT=1 -DUSE_LAPACK -I/opt/cuda/include -Ilinsys/gpu -Wno-c++11-long-long -DCTRLC=1 -DCOPYAMATRIX=1 -DGPU_TRANSPOSE_MAT=1 -DUSE_LAPACK
cc -g -Wall -Wwrite-strings -pedantic -funroll-loops -Wstrict-prototypes -I. -Iinclude -Ilinsys -O3 -fPIC -DCTRLC=1 -DCOPYAMATRIX=1 -DGPU_TRANSPOSE_MAT=1 -DUSE_LAPACK -c -o linsys/scs_matrix.o linsys/scs_matrix.c
cc -g -Wall -Wwrite-strings -pedantic -funroll-loops -Wstrict-prototypes -I. -Iinclude -Ilinsys -O3 -fPIC -DCTRLC=1 -DCOPYAMATRIX=1 -DGPU_TRANSPOSE_MAT=1 -DUSE_LAPACK -c -o linsys/csparse.o linsys/csparse.c
mkdir -p out
ar rv out/libscsgpuindir.a src/scs_indir.o src/util.o src/cones.o src/aa.o src/rw.o src/linalg.o src/ctrlc.o src/scs_version.o src/normalize.o linsys/gpu/indirect/private.o linsys/scs_matrix.o linsys/csparse.o linsys/gpu/gpu.o
ar: creating out/libscsgpuindir.a
a - src/scs_indir.o
a - src/util.o
a - src/cones.o
a - src/aa.o
a - src/rw.o
a - src/linalg.o
a - src/ctrlc.o
a - src/scs_version.o
a - src/normalize.o
a - linsys/gpu/indirect/private.o
a - linsys/scs_matrix.o
a - linsys/csparse.o
a - linsys/gpu/gpu.o
ranlib out/libscsgpuindir.a
cc -g -Wall -Wwrite-strings -pedantic -funroll-loops -Wstrict-prototypes -I. -Iinclude -Ilinsys -O3 -fPIC -DCTRLC=1 -DCOPYAMATRIX=1 -DGPU_TRANSPOSE_MAT=1 -DUSE_LAPACK -o out/run_tests_gpu_indirect test/run_tests.c out/libscsgpuindir.a -lm -lrt -lblas -llapack -L/opt/cuda/lib -L/opt/cuda/lib64 -lcudart -lcublas -lcusparse -Itest
test_fails
Testing that SCS handles bad inputs correctly:eps_abs tolerance must be positive
ERROR: Validation returned failure
Failure:could not initialize work
degenerate
------------------------------------------------------------------
SCS v3.0.0 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem: variables n: 2, constraints m: 4
cones: l: linear vars: 4
settings: eps_abs: 1.0e-06, eps_rel: 1.0e-06, eps_infeas: 1.0e-09
alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
max_iters: 100000, normalize: 1, warm_start: 0
acceleration_lookback: 10, acceleration_interval: 10
lin-sys: sparse-indirect GPU
nnz(A): 4, nnz(P): 2
------------------------------------------------------------------
iter | pri res | dua res | gap | obj | scale | time (s)
------------------------------------------------------------------
0| 2.10e+01 2.00e+00 7.90e+00 -3.95e+00 1.00e-01 1.47e-04
250| 5.69e+11 2.00e+00 0.00e+00 0.00e+00 1.00e+06 2.53e-02
500| 5.69e+11 2.00e+00 0.00e+00 0.00e+00 1.00e+06 5.54e-02
750| 5.69e+11 2.00e+00 0.00e+00 0.00e+00 1.00e+06 7.65e-02
1000| 5.69e+11 2.00e+00 0.00e+00 0.00e+00 1.00e+06 9.70e-02
1250| 5.69e+11 2.00e+00 0.00e+00 0.00e+00 1.00e+06 1.18e-01
1500| 5.69e+11 2.00e+00 0.00e+00 0.00e+00 1.00e+06 1.39e-01
1750| 5.69e+11 2.00e+00 0.00e+00 0.00e+00 1.00e+06 1.60e-01
2000| 5.69e+11 2.00e+00 0.00e+00 0.00e+00 1.00e+06 1.81e-01
2250| 5.69e+11 2.00e+00 0.00e+00 0.00e+00 1.00e+06 2.02e-01
[...]
99750| 5.69e+11 2.00e+00 0.00e+00 0.00e+00 1.00e+06 7.39e+00
100000| 5.69e+11 2.00e+00 0.00e+00 0.00e+00 1.00e+06 7.41e+00
------------------------------------------------------------------
status: solved (inaccurate - reached max_iters)
timings: total: 7.45e+00s = setup: 4.52e-02s + solve: 7.41e+00s
lin-sys: 7.25e+00s, cones: 2.01e-02s, accel: 8.37e-02s
------------------------------------------------------------------
objective = 0.000000 (inaccurate)
------------------------------------------------------------------
INVALID STATUS
Tests run: 2
no fancy options, no julia-shipped blas ;)
~/local/src/scs master ldd ./out/run_tests_gpu_indirect
linux-vdso.so.1 (0x00007ffcff3ba000)
libm.so.6 => /usr/lib/libm.so.6 (0x00007f12b0400000)
librt.so.1 => /usr/lib/librt.so.1 (0x00007f12b03f5000)
libopenblas.so.3 => /usr/lib/libopenblas.so.3 (0x00007f12aefd5000)
liblapack.so.3 => /usr/lib/liblapack.so.3 (0x00007f12ae90b000)
libcudart.so.11.0 => /opt/cuda/lib64/libcudart.so.11.0 (0x00007f12ae669000)
libcublas.so.11 => /opt/cuda/lib64/libcublas.so.11 (0x00007f12a52b5000)
libcusparse.so.11 => /opt/cuda/lib64/libcusparse.so.11 (0x00007f1296ec8000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007f1296cfc000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f12b0597000)
libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f1296cdb000)
libgomp.so.1 => /usr/lib/libgomp.so.1 (0x00007f1296c97000)
libgfortran.so.5 => /usr/lib/libgfortran.so.5 (0x00007f12969db000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f12969c0000)
libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f12969b7000)
libcublasLt.so.11 => /opt/cuda/lib64/libcublasLt.so.11 (0x00007f1282fbb000)
libquadmath.so.0 => /usr/lib/../lib/libquadmath.so.0 (0x00007f1282f70000)
That's strange, I cannot reproduce this on the only gpu machine I have access to. Can you try disabling the AA? You can do it by changing ACCELERATION_LOOKBACK
to 0
in include/glbopts.h
which will disable it for the tests that do not specify it manually and it should be clear if that's the issue.
Here's what my ldd looks like, I don't see any major differences to yours:
└──[ins] => ldd out/run_tests_gpu_indirect
linux-vdso.so.1 (0x00007ffc11d05000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f7c3fcf9000)
libblas.so.3 => /usr/lib/x86_64-linux-gnu/libblas.so.3 (0x00007f7c3fc97000)
liblapack.so.3 => /usr/lib/x86_64-linux-gnu/liblapack.so.3 (0x00007f7c3f5fa000)
libcudart.so.11.0 => /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudart.so.11.0 (0x00007f7c3f375000)
libcublas.so.11 => /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcublas.so.11 (0x00007f7c37e9a000)
libcusparse.so.11 => /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcusparse.so.11 (0x00007f7c29e1c000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f7c29c55000)
/lib64/ld-linux-x86-64.so.2 (0x00007f7c3fe94000)
libopenblas.so.0 => /usr/lib/x86_64-linux-gnu/libopenblas.so.0 (0x00007f7c2781e000)
libgfortran.so.5 => /usr/lib/x86_64-linux-gnu/libgfortran.so.5 (0x00007f7c27574000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f7c2756e000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f7c2754d000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f7c27542000)
libcublasLt.so.11 => /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcublasLt.so.11 (0x00007f7c19776000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f7c1956a000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f7c19550000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f7c19507000)
Can you try running
valgrind --leak-check=full out/run_tests_gpu_indirect
it likely won't help (and is very noisy for gpus) but just in case.
I disabled AA but it changed just the numerical values in the log, not the behaviour; here's valgrind log: https://gist.github.com/kalmarek/adb225c93de2bb8d9a7032caec42eea9
I think the problem is somewhere in problem generation (before scs), since the header looks like this:
test_fails
Testing that SCS handles bad inputs correctly:eps_abs tolerance must be positive
ERROR: Validation returned failure
Failure:could not initialize work
degenerate
------------------------------------------------------------------
SCS v3.0.0 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem: variables n: 2, constraints m: 4
cones: l: linear vars: 4
settings: eps_abs: 1.0e-06, eps_rel: 1.0e-06, eps_infeas: 1.0e-09
alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
max_iters: 100000, normalize: 1, warm_start: 0
lin-sys: sparse-indirect GPU
nnz(A): 4, nnz(P): 2
i.e. first non positive eps_abs and then a problem with 2 variables and 4 constraints?
That's just the output of the first test which is testing data validation and is working correctly. You will see the same if you run the non gpu tests without/run_tests_direct
. The first real problem is a tiny lp with 2 vars and 4 constraints.
I have got the same problem with @kalmarek .
That's just the output of the first test which is testing data validation and is working correctly. You will see the same if you run the non gpu tests with
out/run_tests_direct
. The first real problem is a tiny lp with 2 vars and 4 constraints.
yeah, maybe I should try to compare with run_tests_direct
first ;)
@bodono: so I set VERBOSITY=2
and it seems that cg is never run succesfully. those cuda errors
linsys/gpu/indirect/private.c:506:scs_solve_lin_sys
ERROR_CUDA (#): invalid argument
seem to go away if i replace macro expanded CUBLAS(name)
to the appropriate one, but the end result is the same. I literarly have no idea what I am doing ;), but you could suggest how to diagnose it next I'd be glad!
**********************************************************
Running test: test_validation
Testing that SCS handles bad inputs correctly:
eps_abs tolerance must be positive
ERROR: Validation returned failure
size of scs_int = 4, size of scs_float = 8
Failure:could not initialize work
**********************************************************
**********************************************************
Running test: degenerate
------------------------------------------------------------------
SCS v3.0.0 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem: variables n: 2, constraints m: 4
cones: l: linear vars: 4
settings: eps_abs: 1.0e-06, eps_rel: 1.0e-06, eps_infeas: 1.0e-09
alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
max_iters: 50, normalize: 1, warm_start: 0
acceleration_lookback: 10, acceleration_interval: 10
lin-sys: sparse-indirect GPU
nnz(A): 4, nnz(P): 2
getting pre-conditioner
finished getting pre-conditioner
size of scs_int = 4, size of scs_float = 8
linsys/gpu/indirect/private.c:506:scs_solve_lin_sys
ERROR_CUDA (#): invalid argument
tol 1.000e-12
cg_its 0
------------------------------------------------------------------
iter | pri res | dua res | gap | obj | scale | time (s)
------------------------------------------------------------------
0| 2.10e+01 2.00e+00 7.90e+00 -3.95e+00 1.00e-01 3.27e-04
Norm u = 2.306122, Norm u_t = 1.492570, Norm v = 1.939709, Norm x = 0.000000, Norm y = 4.450789, Norm s = 22.360680, Norm |Ax + s| = 2.24e+01, tau = 1.000000, kappa = 0.000000, |u - u_t| = 1.11e+00, res_infeas = nan, res_unbdd_a = nan, res_unbdd_p = nan, ctx_tau = 0.00e+00, bty_tau = 7.90e+00
linsys/gpu/indirect/private.c:506:scs_solve_lin_sys
ERROR_CUDA (#): invalid argument
tol 1.000e-12
cg_its 0
1| 3.68e+01 2.00e+00 0.00e+00 0.00e+00 1.00e-01 6.66e-04
Norm u = 17.210439, Norm u_t = 18.766100, Norm v = 29.666025, Norm x = 0.000000, Norm y = 0.000000, Norm s = 877.991704, Norm |Ax + s| = 8.78e+02, tau = 17.210439, kappa = 0.000000, |u - u_t| = 1.81e+01, res_infeas = nan, res_unbdd_a = nan, res_unbdd_p = nan, ctx_tau = 0.00e+00, bty_tau = 0.00e+00
linsys/gpu/indirect/private.c:506:scs_solve_lin_sys
ERROR_CUDA (#): invalid argument
tol 1.000e-12
cg_its 0
2| 9.46e+01 2.00e+00 0.00e+00 0.00e+00 1.00e-01 1.37e-03
Norm u = 10.600861, Norm u_t = 22.294830, Norm v = 35.509350, Norm x = 0.000000, Norm y = 0.000000, Norm s = 1226.504583, Norm |Ax + s| = 1.23e+03, tau = 10.600861, kappa = 0.000000, |u - u_t| = 2.20e+01, res_infeas = nan, res_unbdd_a = nan, res_unbdd_p = nan, ctx_tau = 0.00e+00, bty_tau = 0.00e+00
linsys/gpu/indirect/private.c:506:scs_solve_lin_sys
ERROR_CUDA (#): invalid argument
tol 1.000e-12
cg_its 0
3| 2.28e+02 2.00e+00 0.00e+00 0.00e+00 1.00e-01 2.07e-03
Norm u = 5.455154, Norm u_t = 25.405974, Norm v = 40.611483, Norm x = 0.000000, Norm y = 0.000000, Norm s = 1472.679019, Norm |Ax + s| = 1.47e+03, tau = 5.455154, kappa = 0.000000, |u - u_t| = 2.53e+01, res_infeas = nan, res_unbdd_a = nan, res_unbdd_p = nan, ctx_tau = 0.00e+00, bty_tau = 0.00e+00
linsys/gpu/indirect/private.c:506:scs_solve_lin_sys
ERROR_CUDA (#): invalid argument
tol 1.000e-12
cg_its 0
4| 5.39e+02 2.00e+00 0.00e+00 0.00e+00 1.00e-01 2.34e-03
Norm u = 2.454521, Norm u_t = 26.247918, Norm v = 41.989207, Norm x = 0.000000, Norm y = 0.000000, Norm s = 1544.434977, Norm |Ax + s| = 1.54e+03, tau = 2.454521, kappa = 0.000000, |u - u_t| = 2.62e+01, res_infeas = nan, res_unbdd_a = nan, res_unbdd_p = nan, ctx_tau = 0.00e+00, bty_tau = 0.00e+00
linsys/gpu/indirect/private.c:506:scs_solve_lin_sys
ERROR_CUDA (#): invalid argument
tol 1.000e-12
cg_its 0
5| 1.26e+03 2.00e+00 0.00e+00 0.00e+00 1.00e-01 2.62e-03
[...]
48| 1.05e+18 2.00e+00 0.00e+00 0.00e+00 1.00e-01 1.60e-02
Norm u = 0.000000, Norm u_t = 26.457513, Norm v = 42.332021, Norm x = 0.000000, Norm y = 0.000000, Norm s = 1569.004030, Norm |Ax + s| = 1.57e+03, tau = 0.000000, kappa = 0.000000, |u - u_t| = 2.65e+01, res_infeas = nan, res_unbdd_a = nan, res_unbdd_p = nan, ctx_tau = 0.00e+00, bty_tau = 0.00e+00
linsys/gpu/indirect/private.c:506:scs_solve_lin_sys
ERROR_CUDA (#): invalid argument
tol 1.000e-12
cg_its 0
49| 5.29e+17 2.00e+00 0.00e+00 0.00e+00 1.00e-01 1.63e-02
Norm u = 0.000000, Norm u_t = 26.457513, Norm v = 42.332021, Norm x = 0.000000, Norm y = 0.000000, Norm s = 1569.004030, Norm |Ax + s| = 1.57e+03, tau = 0.000000, kappa = 0.000000, |u - u_t| = 2.65e+01, res_infeas = nan, res_unbdd_a = nan, res_unbdd_p = nan, ctx_tau = 0.00e+00, bty_tau = 0.00e+00
50| 5.29e+17 2.00e+00 0.00e+00 0.00e+00 1.00e-01 1.63e-02
Norm u = 0.000000, Norm u_t = 26.457513, Norm v = 42.332021, Norm x = 0.000000, Norm y = 0.000000, Norm s = 1569.004030, Norm |Ax + s| = 1.57e+03, tau = 0.000000, kappa = 0.000000, |u - u_t| = 2.65e+01, res_infeas = nan, res_unbdd_a = nan, res_unbdd_p = nan, ctx_tau = 0.00e+00, bty_tau = 0.00e+00
------------------------------------------------------------------
status: solved (inaccurate - reached max_iters)
timings: total: 5.82e-02s = setup: 4.19e-02s + solve: 1.63e-02s
lin-sys: 1.51e-02s, cones: 1.97e-05s, accel: 3.52e-06s
------------------------------------------------------------------
objective = 0.000000 (inaccurate)
------------------------------------------------------------------
**********************************************************
INVALID STATUS
Tests run: 2
Ok, can you try with VERBOSITY=4
? That should print out some info on whether pcg is running correctly. The fact that you're seeing cg_its 0
is worrying.
The macro itself has an error check when VERBOSITY>0 (see here), which is why the error goes away when you replace it (although it does suggest that only that line is broken, which is strange).
I just pushed c10b3fe228b42140279add05659afe5883eeccf6. Pull that down and see if it fixes it.
Sorry, false alarm.
Even with VERBOSITY=4
I don't see other output, since cg_gpu_norm(cublas_handle, r, n) < tol
is satisfied in https://github.com/cvxgrp/scs/blob/77c86c89bc8d75dce0e8475c364f805fdb62cef0/linsys/gpu/indirect/private.c#L399
If I put the printf
statement above I get the old
linsys/gpu/indirect/private.c:16:cg_gpu_norm
ERROR_CUDA (#): invalid argument
I'm not sure how to test that my CUDA/cublas is installed properly?
Can you try setting USE_L2_NORM
to 1?
I set it to 1 but I get a similar behavior (though no errors). I also checked that nrm
is always 0
in cg_gpu_norm
, though &r[1]
prints as 1.000000
...
This is so strange, I don't understand what's happening here at all and I can't reproduce this behavior on my gpu machine. If you really want to get to the bottom of this then I'm happy to get on a call and we can debug together manually on your machine.
Thanks! I asked for the access to a nvidia gpu at my institution; If I can reproduce it there I'll get back to you!
Dear @bodono I managed to get access to a gpu-enabled node and run some tests there;
- a simple
make test_gpu
which results in
~/local/scs$ ldd ./out/run_tests_gpu_indirect
linux-vdso.so.1 (0x00007fff935d2000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbb17291000)
liblapack.so.3 => /usr/lib/x86_64-linux-gnu/liblapack.so.3 (0x00007fbb16bed000)
libblas.so.3 => /usr/lib/x86_64-linux-gnu/libblas.so.3 (0x00007fbb16b80000)
libcudart.so.10.1 => /usr/lib/x86_64-linux-gnu/libcudart.so.10.1 (0x00007fbb16904000)
libcublas.so.10 => /usr/lib/x86_64-linux-gnu/libcublas.so.10 (0x00007fbb12b69000)
libcusparse.so.10 => /usr/lib/x86_64-linux-gnu/libcusparse.so.10 (0x00007fbb0b8e0000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbb0b6ee000)
/lib64/ld-linux-x86-64.so.2 (0x00007fbb17459000)
libgfortran.so.5 => /usr/lib/x86_64-linux-gnu/libgfortran.so.5 (0x00007fbb0b426000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fbb0b40b000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbb0b405000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fbb0b3e2000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fbb0b3d6000)
libcublasLt.so.10 => /usr/lib/x86_64-linux-gnu/libcublasLt.so.10 (0x00007fbb09532000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fbb09350000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007fbb09306000)
runs just fine (11 out of 11 tests passed).
- This works just fine even when I replace the systems CUDA with the one shipped with julia:
~/local/scs$ LD_LIBRARY_PATH="${CUDA_PATH}/lib" ldd out/run_tests_gpu_indirect
linux-vdso.so.1 (0x00007ffd8ec76000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f472fbad000)
liblapack.so.3 => /usr/lib/x86_64-linux-gnu/liblapack.so.3 (0x00007f472f509000)
libblas.so.3 => /usr/lib/x86_64-linux-gnu/libblas.so.3 (0x00007f472f49c000)
libcudart.so.10.1 => /local/data/zz1594/.julia/artifacts/f049c2824a217dc29dbf657e5cdf0f8adafca77a/lib/libcudart.so.10.1 (0x00007f472f220000)
libcublas.so.10 => /local/data/zz1594/.julia/artifacts/f049c2824a217dc29dbf657e5cdf0f8adafca77a/lib/libcublas.so.10 (0x00007f472b47e000)
libcusparse.so.10 => /local/data/zz1594/.julia/artifacts/f049c2824a217dc29dbf657e5cdf0f8adafca77a/lib/libcusparse.so.10 (0x00007f47241f5000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4724003000)
/lib64/ld-linux-x86-64.so.2 (0x00007f472fd75000)
libgfortran.so.5 => /usr/lib/x86_64-linux-gnu/libgfortran.so.5 (0x00007f4723d3b000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f4723d20000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f4723d1a000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f4723cf7000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f4723ceb000)
libcublasLt.so.10 => /local/data/zz1594/.julia/artifacts/f049c2824a217dc29dbf657e5cdf0f8adafca77a/lib/libcublasLt.so.10 (0x00007f4721e47000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f4721c65000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f4721c1b000)
- however if I try to link against julia provided OpenBLAS with
BLASLDFLAGS="-L${JULIA_BLAS_PATH} -lopenblas64_"
make purge
make -j4 $SCSFLAGS BLASSUFFIX="_64_" BLAS64=1 DLONG=0 BLASLDFLAGS="${BLASLDFLAGS}" test_gpu
which results in
LD_LIBRARY_PATH="${JULIA_BLAS_PATH}" ldd out/run_tests_gpu_indirect
linux-vdso.so.1 (0x00007ffd2f1bb000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f0dd6654000)
libopenblas64_.so => /local/data/zz1594/julia-1.7.2/lib/julia/libopenblas64_.so (0x00007f0dd48fc000)
libcudart.so.10.1 => /usr/lib/x86_64-linux-gnu/libcudart.so.10.1 (0x00007f0dd4680000)
libcublas.so.10 => /usr/lib/x86_64-linux-gnu/libcublas.so.10 (0x00007f0dd08e5000)
libcusparse.so.10 => /usr/lib/x86_64-linux-gnu/libcusparse.so.10 (0x00007f0dc965e000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0dc946a000)
/lib64/ld-linux-x86-64.so.2 (0x00007f0dd681c000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f0dc9447000)
libgfortran.so.5 => /local/data/zz1594/julia-1.7.2/lib/julia/libgfortran.so.5 (0x00007f0dc918c000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f0dc9186000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f0dc917c000)
libcublasLt.so.10 => /usr/lib/x86_64-linux-gnu/libcublasLt.so.10 (0x00007f0dc72d8000)
libstdc++.so.6 => /local/data/zz1594/julia-1.7.2/lib/julia/libstdc++.so.6 (0x00007f0dc70c2000)
libgcc_s.so.1 => /local/data/zz1594/julia-1.7.2/lib/julia/libgcc_s.so.1 (0x00007f0dc70a7000)
libquadmath.so.0 => /local/data/zz1594/julia-1.7.2/lib/julia/libquadmath.so.0 (0x00007f0dc705e000)
I get a failure:
*********************************************************
Running test: hs21_tiny_qp
------------------------------------------------------------------
SCS v3.2.1 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem: variables n: 2, constraints m: 4
cones: b: box cone vars: 4
settings: eps_abs: 1.0e-06, eps_rel: 1.0e-06, eps_infeas: 1.0e-09
alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
max_iters: 100000, normalize: 1, rho_x: 1.00e-06
acceleration_lookback: 10, acceleration_interval: 10
lin-sys: sparse-indirect GPU
nnz(A): 4, nnz(P): 2
------------------------------------------------------------------
iter | pri res | dua res | gap | obj | scale | time (s)
------------------------------------------------------------------
0| 9.61e-01 1.17e-01 1.96e-01 9.80e-02 1.00e-01 4.95e-04
25| 4.08e-04 4.78e-02 1.14e-01 6.94e-18 1.00e-01 4.21e-03
------------------------------------------------------------------
status: infeasible
timings: total: 4.22e-03s = setup: 4.24e-04s + solve: 3.79e-03s
lin-sys: 3.70e-03s, cones: 3.82e-06s, accel: 1.08e-06s
------------------------------------------------------------------
objective = inf
------------------------------------------------------------------
primal obj error inf
dual obj error inf
hs21_tiny_qp: SCS failed to produce outputflag SCS_SOLVED
Tests run: 6
- similarly built
run_tests_[in]direct
pass all tests just fine
Hmmm, if the blas you're using is 64 bit it might be tricky to get everything to work with a GPU which (usually) expects 32 bit integers.
hmm, precisely the same problem happens if I compile with
BLASLDFLAGS="-L${JULIA_BLAS_PATH} -lopenblas"
SCSFLAGS="USE_OPENMP=0 BLAS32=1 DLONG=0"
make purge
CUDA_PATH="${CUDA_PATH}" make -j4 $SCSFLAGS BLASLDFLAGS="${BLASLDFLAGS}" test_gpu
here is a gist from build, tests and ldd. https://gist.github.com/kalmarek/0bb320b84871351bff1bb796e516c4a7
OpenBLAS is the LP64
version (integers are int
s)
Looks like the tests are passing except for hs21, which is probably just because the numerics are slightly different on the GPU and it's producing a bad flag.
@bodono could you have a look at this problem: https://cloud.impan.pl/s/MX5oBX0lHb5LJl2
It's the same problem that you obtain through this code:
let T = SCS.GpuIndirectSolver
A = [
1.0 1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0 1.0
0.0 0.0 1.0 1.0 1.0
-1.0 0.0 0.0 0.0 0.0
0.0 -1.0 0.0 0.0 0.0
0.0 0.0 -1.0 0.0 0.0
0.0 0.0 0.0 -1.0 0.0
0.0 0.0 0.0 0.0 -1.0
]
m, n = Int32.(size(A))
args = (
m = m,
n = n,
A = A,
P = zeros(n, n),
b = [5.0, 3.0, 9.0, 0.0, 0.0, 0.0, 0.0, 0.0],
c = -[3.0, 4.0, 4.0, 9.0, 5.0],
z = 0,
l = 8,
bu = Float64[],
bl = Float64[],
q = Int32[],
s = Int32[],
ep = 0,
ed = 0,
p = Float64[],
)
solution = SCS.scs_solve(T, args..., max_iters=200, write_data_filename="simple_problem.scs")
@test isapprox(solution.x' * args.c, -99.0; rtol = 1e-4)
end
This is easily solvable by the (In)Direct
solvers but fails with our julia bindings to the GPU solver.
Maybe by inspecting it by hand (it's a binary which I have no idea how to digest) we can learn what goes wrong?
this is what I get here:
writing data to simple_problem.scs
------------------------------------------------------------------
SCS v3.2.0 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem: variables n: 5, constraints m: 8
cones: l: linear vars: 8
settings: eps_abs: 1.0e-04, eps_rel: 1.0e-04, eps_infeas: 1.0e-07
alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
max_iters: 200, normalize: 1, rho_x: 1.00e-06
acceleration_lookback: 10, acceleration_interval: 10
lin-sys: sparse-indirect GPU
nnz(A): 12, nnz(P): 0
------------------------------------------------------------------
iter | pri res | dua res | gap | obj | scale | time (s)
------------------------------------------------------------------
0| 1.26e+02 3.95e+00 1.22e+03 -6.94e+02 1.00e-01 7.87e-04
Warning: tol = -1.000000 <= 0, likely compiled without setting INDIRECT flag.
[...]
Warning: tol = -1.000000 <= 0, likely compiled without setting INDIRECT flag.
200| nan nan -nan -nan 1.00e-01 8.29e-01
------------------------------------------------------------------
status: unbounded (inaccurate - reached max_iters)
timings: total: 8.81e-01s = setup: 5.27e-02s + solve: 8.29e-01s
lin-sys: 8.26e-01s, cones: 2.52e-05s, accel: 6.92e-04s
------------------------------------------------------------------
objective = -inf (inaccurate)
------------------------------------------------------------------
Did you compile with the INDIRECT
flag?
this is the script I use to compile scs
script = raw"""
cd $WORKSPACE/srcdir/scs*
flags="DLONG=0 BLAS32=1 USE_OPENMP=0 INDIRECT=1"
blasldflags="-L${libdir} -lopenblas"
CUDA_PATH=$prefix/cuda make BLASLDFLAGS="${blasldflags}" ${flags} out/libscsgpuindir.${dlext}
mkdir -p ${libdir}
cp out/libscs*.${dlext} ${libdir}
"""
DINDIRECT=1
results in the same log
The error message Warning: tol = -1.000000 <= 0, likely compiled without setting INDIRECT flag.
should only appear if the INDIRECT flag is not set during compilation.
When the INDIRECT flag is set SCS does the additional computation to generate a good warm-start and a sensible tolerance for the indirect system:
https://github.com/cvxgrp/scs/blob/f2da64d314d86a97ebb8e957f215f27f9e2a7b79/src/scs.c#L366
Otherwise the tolerance is set to -1.0, which is an invalid tolerance: https://github.com/cvxgrp/scs/blob/f2da64d314d86a97ebb8e957f215f27f9e2a7b79/src/scs.c#L361
And that trips a warning from the indirect system solvers (should probably error out): https://github.com/cvxgrp/scs/blob/8ca03771f0cc7c25697b3e21d28788a2f8ce0fc6/linsys/gpu/indirect/private.c#L474
When that flag is not set SCS skips that computation for speed.
Hmmm, actually this is likely something to do with the GPU solver specifically. There is some issue in there that only trips on some GPUs that I have run into before. It's probably something to do with type sizes that I have not been able to figure out. I would probably recommend shelving the GPU solver for now, the MKL one is typically faster anyway.
Try the following patch. I got all the tests to pass with this fix.
--- a/linsys/gpu/gpu.c
+++ b/linsys/gpu/gpu.c
@@ -19,13 +19,13 @@ void SCS(accum_by_atrans_gpu)(const ScsGpuMatrix *Ag,
if (*buffer != SCS_NULL) {
cudaFree(*buffer);
}
- cudaMalloc(buffer, *buffer_size);
+ cudaMalloc(buffer, new_buffer_size);
*buffer_size = new_buffer_size;
}
CUSPARSE_GEN(SpMV)
(cusparse_handle, CUSPARSE_OPERATION_NON_TRANSPOSE, &onef, Ag->descr, x,
- &onef, y, SCS_CUDA_FLOAT, SCS_CSRMV_ALG, buffer);
+ &onef, y, SCS_CUDA_FLOAT, SCS_CSRMV_ALG, *buffer);
}
/* this is slow, use trans routine if possible */
@@ -48,13 +48,13 @@ void SCS(accum_by_a_gpu)(const ScsGpuMatrix *Ag, const cusparseDnVecDescr_t x,
if (*buffer != SCS_NULL) {
cudaFree(*buffer);
}
- cudaMalloc(buffer, *buffer_size);
+ cudaMalloc(buffer, new_buffer_size);
*buffer_size = new_buffer_size;
}
CUSPARSE_GEN(SpMV)
(cusparse_handle, CUSPARSE_OPERATION_TRANSPOSE, &onef, Ag->descr, x, &onef, y,
- SCS_CUDA_FLOAT, SCS_CSRMV_ALG, buffer);
+ SCS_CUDA_FLOAT, SCS_CSRMV_ALG, *buffer);
}
/* This assumes that P has been made full (ie not triangular) and uses the
@syockit Thanks for this! I applied the patch and it worked! Do you want to turn this into a PR?
The only problem I had was an erroneous 'infeasible' certificate on hs21_tiny_qp
and hs21_tiny_qp_rw
tests. Do you get that too? I was able to get it to pass by tightening the eps_infeas
tolerance in those files so if you have that problem too we can just do that.
@bodono It's a hassle for me to set up a fork right now, so please apply the commit on your side.
You're right, I got the same infeasible certificate on the tests you mentioned. I missed that yesterday. And tightening eps_infeas
did make it feasible.
Sure, no problem @syockit , thanks for sending in the patch!
- the issue mentioned in https://github.com/cvxgrp/scs/issues/180#issuecomment-1301895062 seems to be solved by #251
- I can not reproduce the original issue anymore (probably solved by #246).
I presume this issue can be closed after #251 is merged