Convex.jl icon indicating copy to clipboard operation
Convex.jl copied to clipboard

Massive RAM for moderate problem

Open cossio opened this issue 6 years ago • 36 comments
trafficstars

I have an example of a moderately sized problem where problem formulation is consuming massive memory. I have no idea why this happens. Is there something inherently difficult to this kind of problem? Or is this a bug in Convex.jl?

Note that this is not an issue with any solver. The RAM spike occurs at the problem formulation step, the line problem = maximize(...) below.

To run this code, you need to load the variables N, Adj from the .jld2 file (using JLD2.jl) here:

https://github.com/cossio/cvx_example_data

N is an float 3-dimensional Array, and Adj a sparse matrix.

WARNING: The following code took so much memory that I had to reboot my laptop. Run this in a contained environment where you can constrain memory consumption. (For example, using https://github.com/pshved/timeout, start julia with timeout -m 4000000 julia to limit memory usage to 4GB.)

x = Variable(2730)
S, V, T = size(N)
lN = log.(N)
M = vec(sum(N[:,:,2:T]; dims=(2,3)))
H = Adj * x
problem = maximize(-sum(logsumexp(lN[:,v,t] - H) for t = 1:T-1, v = 1:V) - dot(M, H), [x ≥ -1e2, x ≤ 1e2])
solve!(problem, ECOSSolver())

UPDATE: I'm now getting the RAM spike at the solve phase, instead of the formulation phase. Both ECOS and SCS have the same problem.

cossio avatar Dec 14 '18 13:12 cossio

@ararslan I'm now seeing the memory spike in the solve phase instead of the formulation. I'll close this unless something similar comes up again. I'll open an issue at ECOS or SCS.

cossio avatar Dec 14 '18 23:12 cossio

Okay, thanks for reporting back. Did you see the spike only with a particular solver? If so, it would be good to know which.

ararslan avatar Dec 14 '18 23:12 ararslan

@ararslan I just tested with ECOS and SCS, both consume > 4GB RAM (at which point I cancelled to avoid having to reboot my laptop again)

cossio avatar Dec 14 '18 23:12 cossio

If it's both of them, that suggest to me that the issue may still be Convex's fault. Have you by chance done any profiling using the Profile stdlib package?

ararslan avatar Dec 14 '18 23:12 ararslan

@ararslan Problem is my laptop dies before I can complete the profiling. Any suggestions?

cossio avatar Dec 14 '18 23:12 cossio

Oh that makes sense. Hm, I don't know what to suggest offhand... I wonder if maybe the work could be put into a Task then after a certain amount of time, send an interrupt to the task. Not sure if that would work though.

ararslan avatar Dec 15 '18 00:12 ararslan

I just tried limiting to 10GB ram at a server and it also crashes. Both SCS and ECOS. I don't know how to do the Task thing, and I'm really out of ideas on how to diagnose what's going on.

In case it's really an issue with Convex I'll reopen this. Maybe someone else thinks of something. But feel free to close you think so.

cossio avatar Dec 15 '18 00:12 cossio

@ararslan I started julia with a mem constrain of 10 GB using:

timeout -m 10000000 julia --track-allocation=user

This tracked memory allocations across all of Convex.jl as the RAM usage increased, until it was killed at 10GB. The .mem files are at https://github.com/cossio/cvx_example_data/blob/master/convex_allocations.tar.gz.

I ran this example with SCS. The .mem files of SCS.jl are at https://github.com/cossio/cvx_example_data/blob/master/SCS_allocation_analysis.tar.gz

Let me know if that helps, or if I can do anything else.

BTW, is there an utility to explore a directory tree like this with many .mem files to find the highest allocs?

cossio avatar Dec 15 '18 01:12 cossio

BTW, is there an utility to explore a directory tree like this with many .mem files to find the highest allocs?

Coverage can do this, actually.

julia> using Coverage

julia> mem = analyze_malloc(".");

julia> mem[end-10:end]
11-element Array{Coverage.MallocInfo,1}:
 Coverage.MallocInfo(9785470, "./variable.jl.mem", 22)
 Coverage.MallocInfo(10224350, "./constant.jl.mem", 45)
 Coverage.MallocInfo(14402371, "./constant.jl.mem", 54)
 Coverage.MallocInfo(15131666, "./atoms/affine/add_subtract.jl.mem", 124)
 Coverage.MallocInfo(22798259, "./atoms/affine/add_subtract.jl.mem", 87)
 Coverage.MallocInfo(23872808, "./utilities/show.jl.mem", 20)
 Coverage.MallocInfo(31561657, "./constant.jl.mem", 22)
 Coverage.MallocInfo(47125096, "./expressions.jl.mem", 65)
 Coverage.MallocInfo(120645411, "./conic_form.jl.mem", 112)
 Coverage.MallocInfo(145535180, "./expressions.jl.mem", 106)
 Coverage.MallocInfo(916530186, "./solution.jl.mem", 12)

Looks like the source of the allocations in solution.jl is:

        - function solve!(problem::Problem,
        -                 s::MathProgBase.AbstractMathProgSolver;
        -                 kwargs...)
        -     # TODO: warn about wiping out old model if warmstart=true?
916530186     problem.model = MathProgBase.ConicModel(s)
        0     return solve!(problem; kwargs...)
        - end

ararslan avatar Dec 15 '18 04:12 ararslan

@ararslan Thanks for the suggestion, Coverage helps.

I inspected the .mem files for MathProgBase (https://github.com/cossio/cvx_example_data/blob/master/mpb_allocation.tar.gz) and they show no allocations. Then I realized the solvers implement their own versions of ConicModel(solver).

In ECOS.jl there is this line that stands out. Seems like this is an issue with the solvers after all.

                - function ver()
100391444         ver_ptr = ccall((:ECOS_ver, ECOS.ecos), Ptr{UInt8}, ())
                - return unsafe_string(ver_ptr)
                - end

It is interesting though, this is function is just to return ECOS version?

cossio avatar Dec 17 '18 14:12 cossio

Hm, okay, that's bizarre. You said you encountered the same problem with SCS? Do you have memory allocations for that?

ararslan avatar Dec 17 '18 20:12 ararslan

@ararslan Yes, https://github.com/cossio/cvx_example_data/blob/master/scs_allocation.tar.gz.

julia> mem[end-10:end]
11-element Array{Coverage.MallocInfo,1}:
 Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 387)      
 Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 388)      
 Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 389)      
 Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 397)      
 Coverage.MallocInfo(0, "./SCS.jl.mem", 22)              
 Coverage.MallocInfo(0, "./SCS.jl.mem", 23)              
 Coverage.MallocInfo(1248, "./MPBWrapper.jl.mem", 36)    
 Coverage.MallocInfo(1440, "./SCS.jl.mem", 21)           
 Coverage.MallocInfo(129136, "./MPBWrapper.jl.mem", 31)  
 Coverage.MallocInfo(4402990, "./SCS.jl.mem", 20)        
 Coverage.MallocInfo(79146051, "./c_wrapper.jl.mem", 124)

Again a version call:

        - function SCS_version()
 79146051     return unsafe_string(ccall((:scs_version, SCS.direct), Cstring, ()))
        - end

julia --track-allocations counts the allocations made by ccalls?

cossio avatar Dec 17 '18 21:12 cossio

I'm not sure, actually, though it would appear so. Are these version checks actually called somewhere during the solving process?

ararslan avatar Dec 17 '18 21:12 ararslan

@ararslan I think these solvers print a version banner before starting the iterations. But I'm not sure if that's what we are seeing.

cossio avatar Dec 17 '18 22:12 cossio

If I'm not mistaken, it looks like printing banners is done by the C code itself rather than by the Julia wrappers, so I don't think that would be reflected in the allocations tracked by Julia for the version querying Julia function.

ararslan avatar Dec 17 '18 22:12 ararslan

If you're trying to isolate Convex.jl's processing from the solver, you can call conic_problem to generate the input without solving, E.g., https://github.com/JuliaOpt/ConicBenchmarkUtilities.jl/blob/4267bfedeace99fcc4a03256eef8d44151f38b95/src/convex_to_cbf.jl#L4.

mlubin avatar Dec 18 '18 01:12 mlubin

Thanks @mlubin! I'm seeing the allocations again, just from the call to conic_problem. So we are back to this not being not an issue with the solvers. Here is what I got.

timeout.pl -m 10000000 julia --track-allocation=user
julia> include("cvx_issue.jl")
julia> con = Convex.conic_problem(problem)   # huge memory alloc

where cvx_issue.jl is this file:

using Convex, FileIO, JLD2, ECOS, SparseArrays, LinearAlgebra
@load "cvx_example.jld2"
x = Variable(2730)
S, V, T = size(N)
lN = log.(N)
M = vec(sum(N[:,:,2:T]; dims=(2,3)))
H = Adj * x
problem = maximize(-sum(logsumexp(lN[:,v,t] - H) for t = 1:T-1, v = 1:V) - dot(M, H), [x ≥ -1e2, x ≤ 1e2])

and cvx_example.jld is here: https://github.com/cossio/cvx_example_data.

Inspecting the resulting .mem files from all my packages (using Coverage), gives this:

julia> analyze_malloc(".")[end-20:end]
21-element Array{Coverage.MallocInfo,1}:
 Coverage.MallocInfo(13571105, "./JLD2/KjBIK/src/groups.jl.mem", 92)                      
 Coverage.MallocInfo(15278217, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 124)
 Coverage.MallocInfo(15407963, "./Convex/MLj0c/src/constant.jl.mem", 54)                  
 Coverage.MallocInfo(15721748, "./Convex/MLj0c/src/constant.jl.mem", 22)                  
 Coverage.MallocInfo(15917995, "./JLD2/KjBIK/src/JLD2.jl.mem", 334)                       
 Coverage.MallocInfo(19283630, "./JLD2/KjBIK/src/loadsave.jl.mem", 2)                     
 Coverage.MallocInfo(20910662, "./JLD2/KjBIK/src/JLD2.jl.mem", 352)                       
 Coverage.MallocInfo(22778265, "./JLD2/KjBIK/src/data.jl.mem", 70)                        
 Coverage.MallocInfo(24140413, "./Convex/MLj0c/src/constraints/constraints.jl.mem", 143)  
 Coverage.MallocInfo(26433226, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 87) 
 Coverage.MallocInfo(43264136, "./Compat/HVYNa/src/Compat.jl.mem", 310)                   
 Coverage.MallocInfo(46994104, "./Convex/MLj0c/src/expressions.jl.mem", 65)               
 Coverage.MallocInfo(61731328, "./JLD2/KjBIK/src/data.jl.mem", 252)                       
 Coverage.MallocInfo(79002811, "./ECOS/e0Lr6/src/ECOS.jl.mem", 27)                        
 Coverage.MallocInfo(88962353, "./JLD2/KjBIK/src/JLD2.jl.mem", 399)                       
 Coverage.MallocInfo(95562654, "./JLD2/KjBIK/src/loadsave.jl.mem", 57)                    
 Coverage.MallocInfo(107061059, "./JLD2/KjBIK/src/JLD2.jl.mem", 288)                      
 Coverage.MallocInfo(133353998, "./Convex/MLj0c/src/conic_form.jl.mem", 112)              
 Coverage.MallocInfo(151016207, "./Convex/MLj0c/src/expressions.jl.mem", 106)             
 Coverage.MallocInfo(301982888, "./Convex/MLj0c/src/expressions.jl.mem", 120)             
 Coverage.MallocInfo(351788459, "./JLD2/KjBIK/src/data.jl.mem", 7)                        

I ignore JLD2 since that's just loading the data (I presume). Here are the most interesting lines from expressions.jl:

        - ### User-defined Unions
        - const Value = Union{Number, AbstractArray}
        - const ValueOrNothing = Union{Value, Nothing}
        - const AbstractExprOrValue = Union{AbstractExpr, Value}
        -
151016207 convert(::Type{AbstractExpr}, x::Value) = Constant(x)
        - convert(::Type{AbstractExpr}, x::AbstractExpr) = x
        -
        - function size(x::AbstractExpr, dim::Integer)
        -     if dim < 1
        -         error("dimension out of range")
        -     elseif dim > 2
        -         return 1
        -     else
        -         return size(x)[dim]
        -     end
        - end
        -
        - ndims(x::AbstractExpr) = 2
301982888 get_vectorized_size(x::AbstractExpr) = reduce(*, size(x))
        - lastindex(x::AbstractExpr) = get_vectorized_size(x)

and from conic_form.jl

        - function has_conic_form(conic_forms::UniqueConicForms, exp::AbstractExpr)
133353998     return haskey(conic_forms.exp_map, (exp.head, exp.id_hash))
        - end

I'm not that familiar with this code to figure what the issue might be. @ararslan any ideas?

cossio avatar Dec 18 '18 15:12 cossio

The top ten allocations just from Convex.jl:

julia> analyze_malloc("./Convex/")[end-10:end]
11-element Array{Coverage.MallocInfo,1}:
 Coverage.MallocInfo(9620766, "./Convex/MLj0c/src/variable.jl.mem", 22)                   
 Coverage.MallocInfo(10386046, "./Convex/MLj0c/src/utilities/show.jl.mem", 20)            
 Coverage.MallocInfo(15278217, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 124)
 Coverage.MallocInfo(15407963, "./Convex/MLj0c/src/constant.jl.mem", 54)                  
 Coverage.MallocInfo(15721748, "./Convex/MLj0c/src/constant.jl.mem", 22)                  
 Coverage.MallocInfo(24140413, "./Convex/MLj0c/src/constraints/constraints.jl.mem", 143)  
 Coverage.MallocInfo(26433226, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 87) 
 Coverage.MallocInfo(46994104, "./Convex/MLj0c/src/expressions.jl.mem", 65)               
 Coverage.MallocInfo(133353998, "./Convex/MLj0c/src/conic_form.jl.mem", 112)              
 Coverage.MallocInfo(151016207, "./Convex/MLj0c/src/expressions.jl.mem", 106)             
 Coverage.MallocInfo(301982888, "./Convex/MLj0c/src/expressions.jl.mem", 120)             

cossio avatar Dec 18 '18 15:12 cossio

So this is interesting:

        - function has_conic_form(conic_forms::UniqueConicForms, exp::AbstractExpr)
133353998     return haskey(conic_forms.exp_map, (exp.head, exp.id_hash))
        - end

Here conic_forms.exp_map is an

OrderedDict{Tuple{Symbol,UInt64},
            OrderedDict{UInt64,Tuple{Union{AbstractArray,Number},
                                     Union{AbstractArray,Number}}}}

I would not be surprised if haskey on this type is horrendously inefficient.

ararslan avatar Dec 18 '18 17:12 ararslan

@ararslan The fact that the value-type of the OrderedDict is not concrete (Union{AbstractArray,Number}) may not be ideal. Can this type be more specific? (just throwing a random idea out there)

cossio avatar Dec 18 '18 22:12 cossio

Unfortunately not without a huge undertaking to restructure the way Convex uses and expresses types. For reference, Value in Convex is actually Union{AbstractArray,Number}. (As an aside, because of this, Convex actually overwrites a lot of Base methods, since it defines methods on Value...)

ararslan avatar Dec 18 '18 22:12 ararslan

I would not be surprised if haskey on this type is horrendously inefficient.

I can imagine it being slow but it absolutely should not be allocating much memory. I recommend benchmarking the functions identified here to check if they are actually problems.

Remember this is total memory allocated, so functions that are called a lot will report higher numbers.

These are also bytes allocated, so 301982888 (the highest number) is ~300 MB, which is not massive.

I'm suspicious that compilation memory usage may be in play.

iamed2 avatar Dec 19 '18 21:12 iamed2

@iamed2 It seemed strange to me that my system reported more than 10GB allocations, but the .mem files were reporting lower values. So julia --track-allocations=user misses all the allocations made by the JIT compiler? If that's so, what can I do to measure compiler allocs?

cossio avatar Dec 20 '18 13:12 cossio

I'm saying the opposite; JIT allocations are included so you can't draw conclusions about the code in a function from the allocations number unless you filter them out.

From the docs:

In interpreting the results, there are a few important details. Under the user setting, the first line of any function directly called from the REPL will exhibit allocation due to events that happen in the REPL code itself. More significantly, JIT-compilation also adds to allocation counts, because much of Julia's compiler is written in Julia (and compilation usually requires memory allocation). The recommended procedure is to force compilation by executing all the commands you want to analyze, then call Profile.clear_malloc_data() to reset all allocation counters. Finally, execute the desired commands and quit Julia to trigger the generation of the .mem files.

iamed2 avatar Dec 20 '18 17:12 iamed2

I see, I misunderstood then. Then this does not explain why the system reports > 10 GB allocations.

cossio avatar Dec 20 '18 19:12 cossio

I ran the code with @profile and interrupted it after a bit, then leafed through the results. Anecdotally it's spending a fair amount of time on sparse matrix operations, but I'm not sure that's a primary offender.

ararslan avatar Dec 20 '18 21:12 ararslan

This bug protects itself. It kills your computer before showing itself. I tried Adj = Matrix(Adj) (which should avoid sparse operations), but got the same huge allocations.

cossio avatar Dec 20 '18 21:12 cossio

I tried that as well, but I'll also note that Convex uses sparse matrices in a number of places regardless of whether any particular input is sparse:

$ sift -e sparse src 
src/variable.jl:77:    return sparse(1.0I, vec_size, vec_size)
src/variable.jl:83:        return im*sparse(1.0I, vec_size, vec_size)
src/conic_form.jl:13:# we store the affine functions in this form for efficient manipulation of sparse affine functions
src/atoms/affine/diagm.jl:64:        coeff = sparse(I, J, 1.0, sz * sz, sz)
src/atoms/affine/transpose.jl:63:        transpose_matrix = sparse(I, J, 1.0)
src/atoms/affine/transpose.jl:126:        transpose_matrix = sparse(I, J, 1.0)
src/atoms/affine/partialtrace.jl:47:            a = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:48:            b = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:58:                    a = kron(a, sparse(1.0I, dim, dim))
src/atoms/affine/partialtrace.jl:59:                    b = kron(b, sparse(1.0I, dim, dim))
src/atoms/affine/partialtrace.jl:90:            a = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:91:            b = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:101:                    a = kron(a, sparse(1.0I, dim, dim))
src/atoms/affine/partialtrace.jl:102:                    b = kron(b, sparse(1.0I, dim, dim))
src/atoms/affine/index.jl:68:            index_matrix = sparse(1:sz, J, 1.0, m, n)
src/atoms/affine/index.jl:70:            index_matrix = sparse(1:length(x.inds), x.inds, 1.0, m, n)
src/atoms/affine/multiply_divide.jl:85:            objective = kron(sparse(1.0I, x.size[2], x.size[2]), x.children[1].value) * objective
src/atoms/affine/multiply_divide.jl:89:            objective = kron(x.children[2].value', sparse(1.0I, x.size[1], x.size[1])) * objective
src/atoms/sdp_cone/operatornorm.jl:73:        p = minimize(t, [t*sparse(1.0I, m, m) A; A' t*sparse(1.0I, n, n)] ⪰ 0)

ararslan avatar Dec 20 '18 21:12 ararslan

I have a local branch of Convex that uses FillArrays and LazyArrays in place of SparseArrays where it makes sense to do so. Running the example in this issue, I'm seeing a peak memory usage of about 15%, which is down from 98% or so using SparseArrays alone.

Edit: It's still running after 12 minutes, and the memory usage has increased to 16%.

ararslan avatar Dec 21 '18 21:12 ararslan

Every time I've interrupted the computation, it seems to be in the middle of this: https://github.com/JuliaOpt/Convex.jl/blob/master/src/conic_form.jl#L62-L71. Note in particular the comment that says it's time consuming.

ararslan avatar Dec 21 '18 21:12 ararslan