Convex.jl
Convex.jl copied to clipboard
Massive RAM for moderate problem
I have an example of a moderately sized problem where problem formulation is consuming massive memory. I have no idea why this happens. Is there something inherently difficult to this kind of problem? Or is this a bug in Convex.jl?
Note that this is not an issue with any solver. The RAM spike occurs at the problem formulation step, the line problem = maximize(...) below.
To run this code, you need to load the variables N, Adj from the .jld2 file (using JLD2.jl) here:
https://github.com/cossio/cvx_example_data
N is an float 3-dimensional Array, and Adj a sparse matrix.
WARNING: The following code took so much memory that I had to reboot my laptop. Run this in a contained environment where you can constrain memory consumption. (For example, using https://github.com/pshved/timeout, start julia with timeout -m 4000000 julia to limit memory usage to 4GB.)
x = Variable(2730)
S, V, T = size(N)
lN = log.(N)
M = vec(sum(N[:,:,2:T]; dims=(2,3)))
H = Adj * x
problem = maximize(-sum(logsumexp(lN[:,v,t] - H) for t = 1:T-1, v = 1:V) - dot(M, H), [x ≥ -1e2, x ≤ 1e2])
solve!(problem, ECOSSolver())
UPDATE: I'm now getting the RAM spike at the solve phase, instead of the formulation phase. Both ECOS and SCS have the same problem.
@ararslan I'm now seeing the memory spike in the solve phase instead of the formulation. I'll close this unless something similar comes up again. I'll open an issue at ECOS or SCS.
Okay, thanks for reporting back. Did you see the spike only with a particular solver? If so, it would be good to know which.
@ararslan I just tested with ECOS and SCS, both consume > 4GB RAM (at which point I cancelled to avoid having to reboot my laptop again)
If it's both of them, that suggest to me that the issue may still be Convex's fault. Have you by chance done any profiling using the Profile stdlib package?
@ararslan Problem is my laptop dies before I can complete the profiling. Any suggestions?
Oh that makes sense. Hm, I don't know what to suggest offhand... I wonder if maybe the work could be put into a Task then after a certain amount of time, send an interrupt to the task. Not sure if that would work though.
I just tried limiting to 10GB ram at a server and it also crashes. Both SCS and ECOS.
I don't know how to do the Task thing, and I'm really out of ideas on how to diagnose what's going on.
In case it's really an issue with Convex I'll reopen this. Maybe someone else thinks of something. But feel free to close you think so.
@ararslan I started julia with a mem constrain of 10 GB using:
timeout -m 10000000 julia --track-allocation=user
This tracked memory allocations across all of Convex.jl as the RAM usage increased, until it was killed at 10GB. The .mem files are at https://github.com/cossio/cvx_example_data/blob/master/convex_allocations.tar.gz.
I ran this example with SCS. The .mem files of SCS.jl are at https://github.com/cossio/cvx_example_data/blob/master/SCS_allocation_analysis.tar.gz
Let me know if that helps, or if I can do anything else.
BTW, is there an utility to explore a directory tree like this with many .mem files to find the highest allocs?
BTW, is there an utility to explore a directory tree like this with many .mem files to find the highest allocs?
Coverage can do this, actually.
julia> using Coverage
julia> mem = analyze_malloc(".");
julia> mem[end-10:end]
11-element Array{Coverage.MallocInfo,1}:
Coverage.MallocInfo(9785470, "./variable.jl.mem", 22)
Coverage.MallocInfo(10224350, "./constant.jl.mem", 45)
Coverage.MallocInfo(14402371, "./constant.jl.mem", 54)
Coverage.MallocInfo(15131666, "./atoms/affine/add_subtract.jl.mem", 124)
Coverage.MallocInfo(22798259, "./atoms/affine/add_subtract.jl.mem", 87)
Coverage.MallocInfo(23872808, "./utilities/show.jl.mem", 20)
Coverage.MallocInfo(31561657, "./constant.jl.mem", 22)
Coverage.MallocInfo(47125096, "./expressions.jl.mem", 65)
Coverage.MallocInfo(120645411, "./conic_form.jl.mem", 112)
Coverage.MallocInfo(145535180, "./expressions.jl.mem", 106)
Coverage.MallocInfo(916530186, "./solution.jl.mem", 12)
Looks like the source of the allocations in solution.jl is:
- function solve!(problem::Problem,
- s::MathProgBase.AbstractMathProgSolver;
- kwargs...)
- # TODO: warn about wiping out old model if warmstart=true?
916530186 problem.model = MathProgBase.ConicModel(s)
0 return solve!(problem; kwargs...)
- end
@ararslan Thanks for the suggestion, Coverage helps.
I inspected the .mem files for MathProgBase (https://github.com/cossio/cvx_example_data/blob/master/mpb_allocation.tar.gz) and they show no allocations. Then I realized the solvers implement their own versions of ConicModel(solver).
In ECOS.jl there is this line that stands out. Seems like this is an issue with the solvers after all.
- function ver()
100391444 ver_ptr = ccall((:ECOS_ver, ECOS.ecos), Ptr{UInt8}, ())
- return unsafe_string(ver_ptr)
- end
It is interesting though, this is function is just to return ECOS version?
Hm, okay, that's bizarre. You said you encountered the same problem with SCS? Do you have memory allocations for that?
@ararslan Yes, https://github.com/cossio/cvx_example_data/blob/master/scs_allocation.tar.gz.
julia> mem[end-10:end]
11-element Array{Coverage.MallocInfo,1}:
Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 387)
Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 388)
Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 389)
Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 397)
Coverage.MallocInfo(0, "./SCS.jl.mem", 22)
Coverage.MallocInfo(0, "./SCS.jl.mem", 23)
Coverage.MallocInfo(1248, "./MPBWrapper.jl.mem", 36)
Coverage.MallocInfo(1440, "./SCS.jl.mem", 21)
Coverage.MallocInfo(129136, "./MPBWrapper.jl.mem", 31)
Coverage.MallocInfo(4402990, "./SCS.jl.mem", 20)
Coverage.MallocInfo(79146051, "./c_wrapper.jl.mem", 124)
Again a version call:
- function SCS_version()
79146051 return unsafe_string(ccall((:scs_version, SCS.direct), Cstring, ()))
- end
julia --track-allocations counts the allocations made by ccalls?
I'm not sure, actually, though it would appear so. Are these version checks actually called somewhere during the solving process?
@ararslan I think these solvers print a version banner before starting the iterations. But I'm not sure if that's what we are seeing.
If I'm not mistaken, it looks like printing banners is done by the C code itself rather than by the Julia wrappers, so I don't think that would be reflected in the allocations tracked by Julia for the version querying Julia function.
If you're trying to isolate Convex.jl's processing from the solver, you can call conic_problem to generate the input without solving, E.g., https://github.com/JuliaOpt/ConicBenchmarkUtilities.jl/blob/4267bfedeace99fcc4a03256eef8d44151f38b95/src/convex_to_cbf.jl#L4.
Thanks @mlubin! I'm seeing the allocations again, just from the call to conic_problem. So we are back to this not being not an issue with the solvers. Here is what I got.
timeout.pl -m 10000000 julia --track-allocation=user
julia> include("cvx_issue.jl")
julia> con = Convex.conic_problem(problem) # huge memory alloc
where cvx_issue.jl is this file:
using Convex, FileIO, JLD2, ECOS, SparseArrays, LinearAlgebra
@load "cvx_example.jld2"
x = Variable(2730)
S, V, T = size(N)
lN = log.(N)
M = vec(sum(N[:,:,2:T]; dims=(2,3)))
H = Adj * x
problem = maximize(-sum(logsumexp(lN[:,v,t] - H) for t = 1:T-1, v = 1:V) - dot(M, H), [x ≥ -1e2, x ≤ 1e2])
and cvx_example.jld is here: https://github.com/cossio/cvx_example_data.
Inspecting the resulting .mem files from all my packages (using Coverage), gives this:
julia> analyze_malloc(".")[end-20:end]
21-element Array{Coverage.MallocInfo,1}:
Coverage.MallocInfo(13571105, "./JLD2/KjBIK/src/groups.jl.mem", 92)
Coverage.MallocInfo(15278217, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 124)
Coverage.MallocInfo(15407963, "./Convex/MLj0c/src/constant.jl.mem", 54)
Coverage.MallocInfo(15721748, "./Convex/MLj0c/src/constant.jl.mem", 22)
Coverage.MallocInfo(15917995, "./JLD2/KjBIK/src/JLD2.jl.mem", 334)
Coverage.MallocInfo(19283630, "./JLD2/KjBIK/src/loadsave.jl.mem", 2)
Coverage.MallocInfo(20910662, "./JLD2/KjBIK/src/JLD2.jl.mem", 352)
Coverage.MallocInfo(22778265, "./JLD2/KjBIK/src/data.jl.mem", 70)
Coverage.MallocInfo(24140413, "./Convex/MLj0c/src/constraints/constraints.jl.mem", 143)
Coverage.MallocInfo(26433226, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 87)
Coverage.MallocInfo(43264136, "./Compat/HVYNa/src/Compat.jl.mem", 310)
Coverage.MallocInfo(46994104, "./Convex/MLj0c/src/expressions.jl.mem", 65)
Coverage.MallocInfo(61731328, "./JLD2/KjBIK/src/data.jl.mem", 252)
Coverage.MallocInfo(79002811, "./ECOS/e0Lr6/src/ECOS.jl.mem", 27)
Coverage.MallocInfo(88962353, "./JLD2/KjBIK/src/JLD2.jl.mem", 399)
Coverage.MallocInfo(95562654, "./JLD2/KjBIK/src/loadsave.jl.mem", 57)
Coverage.MallocInfo(107061059, "./JLD2/KjBIK/src/JLD2.jl.mem", 288)
Coverage.MallocInfo(133353998, "./Convex/MLj0c/src/conic_form.jl.mem", 112)
Coverage.MallocInfo(151016207, "./Convex/MLj0c/src/expressions.jl.mem", 106)
Coverage.MallocInfo(301982888, "./Convex/MLj0c/src/expressions.jl.mem", 120)
Coverage.MallocInfo(351788459, "./JLD2/KjBIK/src/data.jl.mem", 7)
I ignore JLD2 since that's just loading the data (I presume). Here are the most interesting lines from expressions.jl:
- ### User-defined Unions
- const Value = Union{Number, AbstractArray}
- const ValueOrNothing = Union{Value, Nothing}
- const AbstractExprOrValue = Union{AbstractExpr, Value}
-
151016207 convert(::Type{AbstractExpr}, x::Value) = Constant(x)
- convert(::Type{AbstractExpr}, x::AbstractExpr) = x
-
- function size(x::AbstractExpr, dim::Integer)
- if dim < 1
- error("dimension out of range")
- elseif dim > 2
- return 1
- else
- return size(x)[dim]
- end
- end
-
- ndims(x::AbstractExpr) = 2
301982888 get_vectorized_size(x::AbstractExpr) = reduce(*, size(x))
- lastindex(x::AbstractExpr) = get_vectorized_size(x)
and from conic_form.jl
- function has_conic_form(conic_forms::UniqueConicForms, exp::AbstractExpr)
133353998 return haskey(conic_forms.exp_map, (exp.head, exp.id_hash))
- end
I'm not that familiar with this code to figure what the issue might be. @ararslan any ideas?
The top ten allocations just from Convex.jl:
julia> analyze_malloc("./Convex/")[end-10:end]
11-element Array{Coverage.MallocInfo,1}:
Coverage.MallocInfo(9620766, "./Convex/MLj0c/src/variable.jl.mem", 22)
Coverage.MallocInfo(10386046, "./Convex/MLj0c/src/utilities/show.jl.mem", 20)
Coverage.MallocInfo(15278217, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 124)
Coverage.MallocInfo(15407963, "./Convex/MLj0c/src/constant.jl.mem", 54)
Coverage.MallocInfo(15721748, "./Convex/MLj0c/src/constant.jl.mem", 22)
Coverage.MallocInfo(24140413, "./Convex/MLj0c/src/constraints/constraints.jl.mem", 143)
Coverage.MallocInfo(26433226, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 87)
Coverage.MallocInfo(46994104, "./Convex/MLj0c/src/expressions.jl.mem", 65)
Coverage.MallocInfo(133353998, "./Convex/MLj0c/src/conic_form.jl.mem", 112)
Coverage.MallocInfo(151016207, "./Convex/MLj0c/src/expressions.jl.mem", 106)
Coverage.MallocInfo(301982888, "./Convex/MLj0c/src/expressions.jl.mem", 120)
So this is interesting:
- function has_conic_form(conic_forms::UniqueConicForms, exp::AbstractExpr)
133353998 return haskey(conic_forms.exp_map, (exp.head, exp.id_hash))
- end
Here conic_forms.exp_map is an
OrderedDict{Tuple{Symbol,UInt64},
OrderedDict{UInt64,Tuple{Union{AbstractArray,Number},
Union{AbstractArray,Number}}}}
I would not be surprised if haskey on this type is horrendously inefficient.
@ararslan The fact that the value-type of the OrderedDict is not concrete (Union{AbstractArray,Number}) may not be ideal. Can this type be more specific? (just throwing a random idea out there)
Unfortunately not without a huge undertaking to restructure the way Convex uses and expresses types. For reference, Value in Convex is actually Union{AbstractArray,Number}. (As an aside, because of this, Convex actually overwrites a lot of Base methods, since it defines methods on Value...)
I would not be surprised if
haskeyon this type is horrendously inefficient.
I can imagine it being slow but it absolutely should not be allocating much memory. I recommend benchmarking the functions identified here to check if they are actually problems.
Remember this is total memory allocated, so functions that are called a lot will report higher numbers.
These are also bytes allocated, so 301982888 (the highest number) is ~300 MB, which is not massive.
I'm suspicious that compilation memory usage may be in play.
@iamed2 It seemed strange to me that my system reported more than 10GB allocations, but the .mem files were reporting lower values. So julia --track-allocations=user misses all the allocations made by the JIT compiler? If that's so, what can I do to measure compiler allocs?
I'm saying the opposite; JIT allocations are included so you can't draw conclusions about the code in a function from the allocations number unless you filter them out.
From the docs:
In interpreting the results, there are a few important details. Under the user setting, the first line of any function directly called from the REPL will exhibit allocation due to events that happen in the REPL code itself. More significantly, JIT-compilation also adds to allocation counts, because much of Julia's compiler is written in Julia (and compilation usually requires memory allocation). The recommended procedure is to force compilation by executing all the commands you want to analyze, then call Profile.clear_malloc_data() to reset all allocation counters. Finally, execute the desired commands and quit Julia to trigger the generation of the .mem files.
I see, I misunderstood then. Then this does not explain why the system reports > 10 GB allocations.
I ran the code with @profile and interrupted it after a bit, then leafed through the results. Anecdotally it's spending a fair amount of time on sparse matrix operations, but I'm not sure that's a primary offender.
This bug protects itself. It kills your computer before showing itself. I tried Adj = Matrix(Adj) (which should avoid sparse operations), but got the same huge allocations.
I tried that as well, but I'll also note that Convex uses sparse matrices in a number of places regardless of whether any particular input is sparse:
$ sift -e sparse src
src/variable.jl:77: return sparse(1.0I, vec_size, vec_size)
src/variable.jl:83: return im*sparse(1.0I, vec_size, vec_size)
src/conic_form.jl:13:# we store the affine functions in this form for efficient manipulation of sparse affine functions
src/atoms/affine/diagm.jl:64: coeff = sparse(I, J, 1.0, sz * sz, sz)
src/atoms/affine/transpose.jl:63: transpose_matrix = sparse(I, J, 1.0)
src/atoms/affine/transpose.jl:126: transpose_matrix = sparse(I, J, 1.0)
src/atoms/affine/partialtrace.jl:47: a = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:48: b = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:58: a = kron(a, sparse(1.0I, dim, dim))
src/atoms/affine/partialtrace.jl:59: b = kron(b, sparse(1.0I, dim, dim))
src/atoms/affine/partialtrace.jl:90: a = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:91: b = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:101: a = kron(a, sparse(1.0I, dim, dim))
src/atoms/affine/partialtrace.jl:102: b = kron(b, sparse(1.0I, dim, dim))
src/atoms/affine/index.jl:68: index_matrix = sparse(1:sz, J, 1.0, m, n)
src/atoms/affine/index.jl:70: index_matrix = sparse(1:length(x.inds), x.inds, 1.0, m, n)
src/atoms/affine/multiply_divide.jl:85: objective = kron(sparse(1.0I, x.size[2], x.size[2]), x.children[1].value) * objective
src/atoms/affine/multiply_divide.jl:89: objective = kron(x.children[2].value', sparse(1.0I, x.size[1], x.size[1])) * objective
src/atoms/sdp_cone/operatornorm.jl:73: p = minimize(t, [t*sparse(1.0I, m, m) A; A' t*sparse(1.0I, n, n)] ⪰ 0)
I have a local branch of Convex that uses FillArrays and LazyArrays in place of SparseArrays where it makes sense to do so. Running the example in this issue, I'm seeing a peak memory usage of about 15%, which is down from 98% or so using SparseArrays alone.
Edit: It's still running after 12 minutes, and the memory usage has increased to 16%.
Every time I've interrupted the computation, it seems to be in the middle of this: https://github.com/JuliaOpt/Convex.jl/blob/master/src/conic_form.jl#L62-L71. Note in particular the comment that says it's time consuming.