MPI support
Is it possible to support MPI? SCS uses OpenMP to realize parallelism. However, when solving large-scale problems, we need MPI support to run them on HPC clusters. Now, SCS can run on only one node and solve SDP problems using the indirect linear system solver with a 4000*4000 variable matrix (about 16000000 variables and 32000000 constraints). If I plan to calculate larger problems, about with a 16000 * 16000 variable matrix, MPI is necessary.
afaik openmp is only used in two pragma calls (exp cone projection and over rows of y += A'*x accumulation). With these sizes you'll probably encounter numerical issues in eigendecomposition, MPI is not going to change it.
Do your problems have some additional structures (symmetry, sparsity, some algebraicness)? You'll be much better off exploiting those than trying to brute-force a large problem.
This is the code for the problem.
import numpy as np
import cvxpy as cp
cov = np.load( "cov.npy")
omega = np.array([[0, 1], [-1, 0]])
M = len(cov) // 2;
Omega = np.kron(omega, np.eye(M))
n = 2 * M
X = cp.Variable((n,n), symmetric=True)
constraints = [cp.bmat([[X, Omega], [-Omega, X]]) >> 0] # or constraints = [X-1j*Omega >> 0]
constraints += [cov - X >> 0]
prob = cp.Problem(cp.Minimize(cp.trace(X)), constraints)
prob.solve(solver = 'SCS',use_indirect=True,verbose=True)
I can calculate a not-bad analytical solution for this problem but I don't know how to use it to speed up the optimization. Although SCS supports warm start, it seems that only knowing the initial value of X seems insufficient to use it.
I've never used scs through python, but I recall for warmstart you need to provide both primal and dual variables.
what is in cov.npy? does it have any structure (sparse, symmetric, chordal, etc)?
cov.npy is a positive semidifinite matrix and satisfy the constraint that cov-1j*Omega positive semidifinite. I also have a julia version but it seams that the parallel efficiency is worse than python version
using LinearAlgebra
using JuMP
using SCS
omega = [0 1; -1 0]
M = div(size(cov)[1], 2)
Omega = kron(omega, I(M))
n = 2 * M
model = Model(SCS.Optimizer)
set_attribute(model, "linear_solver", SCS.IndirectSolver)
@variable(model, X[1:n, 1:n], PSD)
@constraint(model, [X Omega; -Omega X] in PSDCone())
# @constraint(model, LinearAlgebra.Hermitian(X-Omega*1im) in HermitianPSDCone())
@constraint(model, cov - X in PSDCone())
@objective(model, Min, tr(X))
optimize!(model)
yes, this I understand, but does cov.npy have any additional structure? :D
But the best speedup will be achieved by just not solving such large an optimization problem...
I don't think cov.npy have some additional structure. It is a covariance matrix of a gaussian state.
You can warm-start just some of the variables I think, SCS will set the others to zero. But SCS is expecting the variables corresponding to the conic reformulation that CVXPY does under the hood, so you would need access to those. If you have one of the conic variables as a warm-start it should be possible to get reasonable (better than zero) warm-starts for the others too, using the KKT equations.
As for MPI, as @kalmarek says we only have OpenMP pragmas over a couple of minor routines, and your problem will likely be dominated by the eigendecomposition of the semidefinite cone variable. That being said, we do support MKL as the blas/lapack backend and that definitely uses multiple threads for the eigendecomp, it may even support MPI for clusters. That's probably your best bet.
Oh also the MKL pardiso direct solver is much faster than the indirect solver, I would recommend using the direct solver if you can.
Thanks for your reply. I've tried MKL pardiso direct solver. The time it took is simular to the indirect solver. Maybe I need to try other solvers or just use the analytical solution.
I am very surprised that the MKL pardiso solver requires about the same time as the indirect solver! It might be worth playing with the MPI settings (more threads or whatever) or the SCS solver settings to see if there is anything to squeeze out.
G.cov is a (992, 992) PSD matrix. The indirect solver is faster and the CPU usage of the two solvers are both ~90%
Can you post the entire output trace (or the very end)? I would be interested in seeing the total times / iterations for each.
Are you controlling the number of threads? I think you can do that with env variable OMP_NUM_THREADS. It might be worth tinkering with that a bit.
It seems that the preprocess of pardiso uses only one thread and takes a lot of time
Yes it's doing a factorization, I didn't realize that was done single-threaded, seems like a waste! That probably explains the slowdown for MKL then. In principle we could write an MKL indirect solver which would be faster, but that would take a little bit of work.