cvxpy Faster canonicalization of affine expressions

CVXPY canonicalizes affine expression trees by constructing a sparse matrix for each linear operator, and recursively multiplying these sparse matrices together (see cvxpy/cvxcore/src/). This is fine for small problems, but it becomes a bottleneck (both time and memory) for problems with many constraints, and also for problems with large variables/data. For example, matrix stuffing in cvxcore is the bottleneck for cvxpy/tests/test_benchmarks.py:TestBenchmarks.test_diffcp_sdp_example.

Parallelizing the matrix stuffing (across expression trees) is one way to speed things up (#706); however, this can cause OOMs for large problems (e.g., using 12 processes to canonicalize the SDP in the benchmark mentioned above, with n=300, p=100, OOM'd my 2012 MBP).

Another way might be to do something similar to JuMP, which I believe maintains the coefficients for each variable along the way, using elementwise operations to update them as needed. @angeris brought JuMP's approach to my attention. One way to see if this is promising would be to implement the SDP example in JuMP and compare the model construction time to CVXPY.

Apr 19 '19 20:04 akshayka

Assigning @angeris, but contributions and ideas from others are still welcome.

Apr 19 '19 21:04 akshayka

I would be careful about taking an elementwise approach to tracking symbolic affine expressions. The main thing that can go wrong is the speed of adding large expressions together. If you have x and y as two scalar expressions, representing linear combinations of 1000 symbolic terms each, then computing the symbolic expression x + y can be expensive. If you then have large matrices containing such expressions, then performing these symbolic scalar additions for every element of the resulting matrix can take a long time.

I implemented such an approach for a separate project I work on, where a symbolic ScalarExpression is an operator-overloaded dictionary that maps a hashable ScalarVariable symbol to a coefficient. The performance was very good for small problems, but it didn't scale well. There were some tricks to cope with this, but I ended up using a different approach for large problems. I don't think my solution would be appropriate for cvxpy, but I'll post the idea nonetheless.

pre-compiled affine operators

Suppose I needed to constrain a symbolic expression M1 @ y1 + M2 @ y2 >= 0, where y1 and y2 are symbolic expressions over variables x. The normal computations here would be ...

Perform a symbolic matrix vector product M1 @ y1; store the result in z1.
Perform a symbolic matrix vector product M2 @ y2; store the result in z2.
Perform a symbolic vector-vector additon, z1 + z2; store the result in z3.
From the symbolic expression z3, generate A and b so that z3 >= 0 can be represented as A @ x + b >= 0.

A pre-compiled approach would be to define a single function that takes in arguments M1, y1, M2, y2 and returns the data A, b for the desired affine expression, without performing a single symbolic matrix-vector product.

Pre-compiled affine operators have the following upsides

They can be orders of magnitude faster than methods which require intervening symbolic computation.
They are simple to write (and often must be written when building models in languages like C++)
They don't need to be used when prototyping small models. You only need to bother with them when performance is critical.

Pre-compiled affine operations have the following downsides

Small changes to the form of a given affine operator may require a separate "pre-compiled" implementation.
The output of a pre-compiled affine operator cannot be used for subsequent symbolic computation.

Apr 20 '19 22:04 rileyjmurray

Unrelated to my previous post:

When cvxcore multiplies the sparse matrices together, does it run the standard dynamic programming algorithm to find the optimal order of multiplication?

https://en.wikipedia.org/wiki/Matrix_chain_multiplication

That algorithm could also be modified to account for the sparsity factor of a given matrix (i.e. a 10,000-by-10,000 matrix with 100 entries is "smaller" than a dense 500-by-500 matrix).

Apr 20 '19 22:04 rileyjmurray

Some quick notes on why this might be a good idea (and, potentially, a terrible one)

The initial problem came up when attempting to optimize a (relatively small) problem with a large number of reshapes and concatenations. It seemed that JuMP would heavily outperform CVXPY in a similar problem of this form, but it was somewhat unclear why this would be the case.

When profiling this, @akshayka found a few things: (a) is_constant and many similar methods were being recomputed, even though most atoms are essentially immutable (fixed by 9b59d045ef8) and (b) that a huge chunk of time was being spent by cvxcore in generating the correct affine transformations. This is particularly surprising, since most of the affine expressions present in the program are mostly concerned with stacking and concatenating matrices and vectors (note that the size of the final program was moderate: 4000 variables with ~4000 very sparse SOC constraints, generated essentially by a tridiagonal matrix). For more information, take a peek at this profile dump of the code.

Another somewhat large chunk of time seems to be spent converting and checking the validity of sparse matrices (e.g., calls to coo.py:_check, from SciPy) and similar things. This, at least to me, appears to be that there are a large number of unnecessary conversions between formats or that too many operations are being performed on these matrices, even though most of them will have a very small number of nonzero entries.

So, this generates some ideas, which might be interesting to try and implement (not all of them independent of each other):

Full elementwise construction of build_matrix either during the construction of the parse tree or after.
Elementwise construction of all (and only) the sub-indexing or stacking expressions, etc. More specifically, the types of operations that generate submatrices of the identity or kronecker products thereof, and leaving the rest for cvxcanon.
Working on a DP-type approach as @rileyjmurray suggested in the second post. This part is fairly unclear to me, as (unlike in the dense case) it seems that different sparsity patterns will yield very different matrix-build times. I'm also not sure if this is being done in the dense case? @SteveDiamond ?

Number 1 is essentially JuMP's approach. This yields somewhat similar times in the current CVXPY implementation, for example, when naïvely translating test_benchmarks.py:TestBenchmarks.test_diffcp_sdp_example to Julia, as in this gist. Of course, the CVXPY tr function could be a bit smarter about computing the correct result, but that's another point.

There are a few questions, though: (a) Python isn't Julia in terms of element-wise speed, of course, so we may not be able to get away with this, unlike JuMP. But the question remains: what is the performance impact for most problems? This is sort-of addressed by suggestion 2, which attempts a "best-of-both-worlds" approach, but then (b) what should the interface look like between cvxcore and cvxpy? Almost all of these matrix-building operations are currently being sent off to cvxcore (as far as I understand and from what I've seen), but now we have some transformations done within cvxpy and others within cvxcore. Unsure of what the implications in terms of structure are, but with the decreased communication cost and likely reduced number of operations, it's worth thinking about.

Suggestion 3 strikes me as somewhat sensible, since it's quite possible that a simple heuristic will do quite well in the sparse case, but it's not clear to me what such a heuristic would look like (the suggestion above may be good, but I really don't have much intuition about what the results might be). I also know that @SteveDiamond is currently working on cvxcore, so I'll leave this for him to think about :)

Anyways, I likely forgot a few things on this topic, so I may come back and edit this later comment, but this is, at the moment, where we are at.

Additionally, apologies for the partial response, @rileyjmurray : I haven't thought enough about your first suggestion to fully appreciate it, though it's likely that @akshayka and @SteveDiamond have some thoughts?

Apr 21 '19 01:04 angeris

@rileyjmurray, we don't find an optimal order for matrix multiplications. This does sound like a good idea. Using the number of nonzeros in a sparse matrix as a heuristic might work? In any case it would be easy to experiment with this, and with various heuristics. This, plus modifying cvxcore to process constraints in parallel, should yield pretty good speed-ups for problems with many constraint. As @angeris mentioned, cvxcore is currently being rewritten to extract an affine map from parameters to problem data, so we should probably wait until that rewrite is complete before spending too much time on performance optimization.

Options (1) and (2) would be interesting to try out, but I think we'd need substantial evidence that elementwise construction is a good idea before investing the time needed to actually implement it robustly.

Apr 24 '19 19:04 akshayka

@akshayka I agree that we should wait on modifications to the matrix multiplications until cvxcore settles down. As a start I think implementing the basic matrix-chain-multiplication algorithm is likely to improve performance. The sparse case will be harder, but I actually think it would make an excellent research project. It's simple enough to state that an undergrad (or team of undergrads) could work on it, and I bet there are applications beyond cvxcore-type computations that would benefit from an understanding of this problem. Do you or @SteveDiamond know of any undergrads in Boyd's lab that might want to take on this project? If not I can ask around Caltech (the problem is that I'll be away from Caltech this summer, so I couldn't supervise someone here).

Apr 24 '19 19:04 rileyjmurray

The sparse case will be harder, but I actually think it would make an excellent research project. It's simple enough to state that an undergrad (or team of undergrads) could work on it, and I bet there are applications beyond cvxcore-type computations that would benefit from an understanding of this problem.

That does sound like a fun project! We'll ask around and see if anyone's interested. If we can't find anyone, we'll let you know.

Apr 24 '19 19:04 akshayka

@akshayka did you end up finding someone to work on the sparse matrix-chain multiplication problem?

Nov 03 '19 18:11 rileyjmurray

Nope, unfortunately.

Nov 03 '19 19:11 akshayka

I'm interested in what I think is the same problem in Convex.jl. I have been exploring rewriting some of the internals (ref https://github.com/jump-dev/Convex.jl/pull/393), and have come to the problem of composing many affine transformations together. I also came across the idea of solving the matrix chain problem, but the issue for me is that the transformations are generally not just a matrix-multiplication but also a vector addition (i.e. a sequence of x -> A*x +b). Multiplying those out gives a combinatorial explosion of terms, but left-to-right and right-to-left one can accumulate into a matrix and a vector result, i.e. something like

for (new_matrix, new_vector) in transformations
    vec = vec + matrix * new_vector
    matrix = matrix * new_matrix
end

I was wondering how CVXPY avoids having the vector addition to deal with in order to consider matrix chain multiplication; I tried looking at the source but unfortunately I am almost illiterate in C++ and python.

P.S. My solution for now is to always go left-to-right, because for the objective function, the result will be 1-dimensional, so you are essentially performing a sequence of matvecs instead of matmuls, which is probably optimal. For constraints, of course, the output could have a higher dimension, so this might not be the right choice in that case.

edit: on second thought, presumably the matrix-multiplication part dominates and a usual matrix-chain multiplication algorithm would suffice as a heuristic to choose an ordering of composing affine operations!

Aug 02 '20 13:08 ericphanson

Hi @ericphanson cvxpy uses matrix multiplications of the form

[A b][x]
[0 1][1]

to represent affine transformations. Not sure if that was your question.

Aug 02 '20 16:08 SteveDiamond

Yep it was! Thanks, I might try that for Convex.jl too then.

Aug 02 '20 16:08 ericphanson

@sbarratt , since you expressed interest in this problem.

Oct 15 '20 15:10 akshayka

Compiling problems that are (very close to) standard forms should be fast.

https://github.com/cvxpy/cvxpy/issues/1521

Nov 04 '21 16:11 akshayka

I am eager to attempt an implementation of the matrix-chain-multiplication problem, as I occasionally experience the performance implications of this issue myself, and I am keen to expand my understanding of scientific computing in general.

There are a few aspects that remain unclear to me:

I noticed that @phschiele has shifted this issue from "Help Wanted" to "In Progress", which I presume is connected to cvxpy 1.3.0's new Scipy canonicalization backend. If that is the case, could you please explain the relationship between this new backend and the issue at hand?
Considering this new backend, on which areas should one concentrate to achieve the most significant performance improvements?
Would you be able to recommend any academic literature that might assist in addressing this issue? My familiarity with sparse matrices is limited, and I lack a strong intuition for the performance characteristics of handling sparse matrices.

I appreciate your assistance!

Apr 03 '23 13:04 mvanaltvorst

Hi @mvanaltvorst, the newer backends no longer rely (purely) on matrix multiplication chains, so this approach is probably not the first optimization we would try to speed up the canonicalization further. Closing this issue as we will likely open new ones discussing methods to improve the new backend implementations.

Apr 27 '24 23:04 phschiele