Distributed.jl icon indicating copy to clipboard operation
Distributed.jl copied to clipboard

memory is not being released aggressively enough during parallel processing

Open CodeLenz opened this issue 9 years ago • 32 comments

I have a large codebase that runs without any problems in 0.4.0, but shows symptoms of memory leak when running in 0.4.3.

Basically, the memory increases after a call to @sync @parallel for. It seems that some of the memory allocated for the parallel processing is not freed. A call to gc() in the end of the loop does not fixes the problem.

The figure below shows the memory allocation for the same script using 0.4.0 (first half of the graphic) and 0.4.3, where it can be seen thtat the memory increases after each call to @sync @parallel for.

processos

It is the first time I fill an issue, so if I am not using the proper words/methodology, I apologize.

CodeLenz avatar Mar 11 '16 23:03 CodeLenz

Can you provide a self-contained runnable example that reproduces the problem?

tkelman avatar Mar 12 '16 03:03 tkelman

Hi.

Sorry, but my first concern was if this behavior was already addressed by someone else.

Actually, I think it is something related to the way the GC is dealing with the variables. As you can see in the picture, the memory is steadily increasing, leading to the undesirable "hang" associated to it. When using a smaller system of equations, we observed that it is indeed increasing, but after some iterations it drops again (I saw it after filling this issue).

We were not awere of it when using 0.4, since the GC must be cleaning the memory after the parallel processing. 0.5-dev behaves like 0.4.3.

I am trying to make a simple prototype, since there are a lot of calls and intermediary allocation inside the parallel loop. The base algorithm should be something in the line of:

const PARALLEL = true
const NPROC    = 3

if PARALLEL
   npp = nprocs()
   if npp==1
        println("\n Adding.... ")
        addprocs(NPROC')
   end
    # Lets avoid openblas using our cores....
   blas_set_num_threads(1)
end

function MAIN()
    # Number of "external" iterations
    const NE = 10

    # Main Dimension
    const N = 5000

    #Loop
    @inbounds for e=1:NE


        println("\n Iteration ",e)

        # Random R.H.S vectors (6 cols)
        const F = rand(N,6)

        # Coefficient Matrix
        const K1 = sprand(N,N,0.2)

        # Solve each of one of the R.H.S, using parallel processes..

        # Diagonal Scaling
        const d = diag(K1)
        const P = sqrt(d)
        const Pi = 1./P
        const AUX = spdiagm(Pi)
        const K = AUX*K1*AUX
        U = zeros(N,6)
        if PARALLEL
            U = @sync @parallel (hcat) for load = 1:6
                const V = Pi.*F[:,load]
                const resp = K*V
                Pi.*resp
            end
        else
            for load = 1:6
                const V = Pi.*F[:,load]
                const resp = K*V
                U[:,load] = Pi.*resp
            end
        end

        # wait a little bit...
        sleep(5)

        end #e

end

# Lets go...
MAIN()

CodeLenz avatar Mar 12 '16 19:03 CodeLenz

Not too long ago I had a memory leak somewhere in a similar situation. It got fixed in a major rewrite, so I wasn't worried about it enough to make it an issue. It was also on 0.4.3. I think I still might have the code for it too, but I didn't have the time to reduce it into something small.

For posterity, in case someone runs into a similar problem and finds this issue: Periodic @everywhere gc(true) worked for me as a stopgap during the few days of having to deal with the leak. Nothing else helped. Try it, might work for you.

kswietli avatar Mar 14 '16 10:03 kswietli

Hi.

I think that memory leak is not the case, since after some iterations the memory is reduced. But, indeed, the behaviour with 0.4.3 is different when compared to 0.4.0, since it may be a regression in the way gc() deals with memory allocation and parallel processing.

The following code reproduces the problem and is simpler than the previous code.

# Dimension - Depends on avaliable RAM
const N = 5000

# Number of iterations
const NI = 10

# Time to wait inside the subroutine
const tw = 5

# Number of procs - Depends on the avaliable number of cores
const NP = 3

# If it is the case, add the processes
if nprocs()==1
    addprocs(NP)
end


# Main loop...
for i=1:NI


    println("\n Iteration ",i)

    # Generates a random sparse matrix
    K = sprand(N,N,0.5)

    # Allocates a 6 column array
    F = rand(N,6)

    U = @sync @parallel (hcat) for j=1:6

        # To make the visualization easier
        sleep(tw)

        # In my real code there are a lot of calls here, with many
        # local allocations...this is why the graphic has larger hills
        # when compared to this example.
        K*F[:,j]

    end #@parallel


end # i

CodeLenz avatar Mar 14 '16 11:03 CodeLenz

I just tested with the above code on

Julia Version 0.4.4-pre+43
Commit 147d6d0* (2016-03-07 00:34 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.5.0)

After all iterations finished I required 2 executions of @everywhere gc() to release all memory. There is no leak. This is the same issue as https://github.com/JuliaLang/Distributed.jl/issues/35, the generational gc is holding onto objects a bit longer.

@carnaval , @yuyichao , if we could specify the maximum heap size (like Java's -Xmx startup argument), the user can better control how much a Julia process is allowed to grow.

amitmurthy avatar Mar 14 '16 12:03 amitmurthy

Changed title to reflect the same.

amitmurthy avatar Mar 14 '16 12:03 amitmurthy

kswietli

I can confirm that @everywhere gc() works (but the code is slightly slower).

[pao: quoting]

CodeLenz avatar Mar 14 '16 12:03 CodeLenz

It was a genuine leak in my case. Unless "brings the machine down without explicit calls to gc(true)" is a normal situation. 1% memory consumption with repeated calls to gc(true) and the same after the rewrite that was done. 20% without explicit gc(true). This is per process. The system would hit swap reliably towards the end of some of the longer runs of the program. I also think calling just gc() wasn't doing the trick, but it may just be my mind playing tricks on me. I can check, if there is interest in that.

kswietli avatar Mar 14 '16 13:03 kswietli

Yes please.

It is not a leak in the sense that we are not losing references to objects without memory being freed in the system. The need to call gc() explicitly means that we do have an issue that needs to be fixed. It will be great if you could provide code that can simulate the problem.

amitmurthy avatar Mar 14 '16 13:03 amitmurthy

Is the problem that the gc does not know what size your things are since they are remote and the remote ref looks really small ? Gc's trigger is (mainly) "how many bytes you allocated since last time".

There are functions that are used for C malloc, you might be able to do more or less the same things for remote objects ? (except without doing any allocation, just the accounting)

See https://github.com/JuliaLang/julia/blob/master/src/gc.c#L2681

carnaval avatar Mar 14 '16 16:03 carnaval

What changed between 0.4.0 and 0.4.3 that is causing such different behaviour ? I was looking for something related to gc in the changelogs and could not find it.

CodeLenz avatar Mar 14 '16 19:03 CodeLenz

We should really figure out what changed here since ideally point upgrades shouldn't break people's programs that used to work (unless they depended on buggy behavior).

StefanKarpinski avatar Mar 14 '16 22:03 StefanKarpinski

I'll do a bisect.

Is the problem that the gc does not know what size your things are since they are remote and the remote ref looks really small ?

Yes and no. Because they are small they are not being finalized quickly enough (which sends a message to the remote side to actually release the resource). But we saw that in JuliaLang/Distributed.jl#35, even with the "release message" issue being addressed (by manually finalizing), a slow steady increase in memory persisted. I think it is a issue of a full gc not being performed in time.

Gc's trigger is (mainly) "how many bytes you allocated since last time".

While a full gc too frequently will slow down programs, I feel it is imperative that we support a configurable max heap size. That will be a hard limit to do a full gc whenever reached.

amitmurthy avatar Mar 15 '16 03:03 amitmurthy

It appears this is a Windows issue.

I just tested on OSX for 1000 iterations of both the original as well the revised samples (without a sleep) and could not detect a leak. The residual steady state allocations (and after 1000 iterations) were approx. 500MB for the master and 400MB for the workers. These were released only after a couple of @everywhere gc() calls when they dropped to 170MB and 110MB respectively . But this is expected behavior if GC triggering only depends of new allocation requests.

amitmurthy avatar Mar 15 '16 05:03 amitmurthy

I'm about a third done reducing my case. Gonna be a few more days, I'm pretty busy.

I am running Arch Linux, so it isn't a Windows-specific problem, at least with my code.

kswietli avatar Mar 15 '16 07:03 kswietli

It is also happening in linux (Fedora 23, with 0.4.3...)

CodeLenz avatar Mar 15 '16 15:03 CodeLenz

Could you post your versioninfo()? I am unable to detect a leak on Ubuntu 15.10 with

  | | |_| | | | (_| |  |  Version 0.4.3 (2016-01-12 21:37 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-unknown-linux-gnu

julia> versioninfo()
Julia Version 0.4.3
Commit a2f713d (2016-01-12 21:37 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

I have been running

# Dimension - Depends on avaliable RAM
const N = 5000

# Number of iterations
const NI = 1000

# Time to wait inside the subroutine
const tw = 5

# Number of procs - Depends on the avaliable number of cores
const NP = 3

# If it is the case, add the processes
if nprocs()==1
    addprocs(NP)
end


# Main loop...
for i=1:NI


    println("\n Iteration ",i)

    # Generates a random sparse matrix
    K = sprand(N,N,0.5)

    # Allocates a 6 column array
    F = rand(N,6)

    U = @sync @parallel (hcat) for j=1:6

        # In my real code there are a lot of calls here, with many
        # local allocations...this is why the graphic has larger hills
        # when compared to this example.
        K*F[:,j]

    end #@parallel


end # i

top shows a max memory usage of 3.2% for the workers.

screenshot from 2016-03-15 21-00-02

amitmurthy avatar Mar 15 '16 15:03 amitmurthy

This is a generic linux image from https://julialang.s3.amazonaws.com/bin/linux/x64/0.4/julia-0.4.3-linux-x86_64.tar.gz

amitmurthy avatar Mar 15 '16 15:03 amitmurthy

I have tried that reduced code, it works fine on my Linux. That probably means that my problem is distinct from this one.

I hope to have some breathing room over the weekend to reduce my stuff to something postable.

kswietli avatar Mar 15 '16 20:03 kswietli

Thanks for helping track this down, @kswietli and @CodeLenz.

StefanKarpinski avatar Mar 16 '16 00:03 StefanKarpinski

@StefanKarpinski Thank you for such amazing software and such open community. All my students are already using Julia in their thesis and undergraduate research projects. This issue happened with one of my graduate students when upgrading from 0.4.0 to 0.4.3.

@amitmurthy Right now I dont have acess to the linux machine, but as soon as possible I will post the versioninfo().

CodeLenz avatar Mar 16 '16 10:03 CodeLenz

julia _ _ _ ()_ | A fresh approach to technical computing () | () () | Documentation: http://docs.julialang.org _ _ | | __ _ | Type "?help" for help. | | | | | | |/ ` | | | | || | | | (| | | Version 0.4.3 (2016-01-12 21:37 UTC) / |_'|||__'| | |__/ | x86_64-redhat-linux

julia> versioninfo() Julia Version 0.4.3 Commit a2f713d* (2016-01-12 21:37 UTC) Platform Info: System: Linux (x86_64-redhat-linux) CPU: Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz WORD_SIZE: 64 BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Nehalem) LAPACK: libopenblasp.so.0 LIBM: libopenlibm LLVM: libLLVM-3.3

CodeLenz avatar Mar 16 '16 12:03 CodeLenz

Hmmm, I just ran the test script on a Fedora 23 on AWS with the generic binary and could not detect a leak.

The processes reached a steady state quickly enough. screen shot 2016-03-17 at 8 02 40 am

  | | |_| | | | (_| |  |  Version 0.4.3 (2016-01-12 21:37 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-unknown-linux-gnu

julia> versioninfo()
Julia Version 0.4.3
Commit a2f713d (2016-01-12 21:37 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

I wonder if the errors you are seeing is due to system running out of memory. How much RAM to the machines have?

Looking forward to @kswietli 's script to debug this further.

amitmurthy avatar Mar 17 '16 02:03 amitmurthy

Good news: I got it down to a very reasonable size. Bad news: It hits swap even without parallel anything on 0.4.3. You can change the pmap to a map and it will still swap.

The master from a short while back appears to cope much better. I gave it 15 minutes and it still kept on growing by then. 0.4.3 runs out pretty much immediately. The fact it keeps on growing is still unnerving me.

type Inner
    data :: Int64
end

type Outer
    inners :: Vector{Inner}
end

function generate_outer()
    inners = Array(Inner, 1000)

    for position in 1:1000
        inners[position] = Inner(0)
    end

    return Outer(inners)
end

function loop(unused)
    outers = Array(Outer, 1000)

    for position in 1:1000
        outers[position] = generate_outer()
    end

    return
end    

function sweep()
    while true
        dontcares = Array(Int64, 100)
        pmap(loop, dontcares)

        #Uncomment this to not hit swap on 0.4.3.
        #gc()
    end
end

sweep()
julia> versioninfo()
Julia Version 0.4.3
Commit a2f713d (2016-01-12 21:37 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E3-1226 v3 @ 3.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libm
  LLVM: libLLVM-3.3
julia> versioninfo()
Julia Version 0.5.0-dev+3131
Commit b3b1411 (2016-03-13 10:42 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E3-1226 v3 @ 3.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

EDIT: Sorry about the layout. Github is not being nice to my code for some reason, thus the code attached and the versioninfos look scrambled.

edit: fixed formatting - Amit.

kswietli avatar Mar 17 '16 10:03 kswietli

Thanks @kswietli , I can confirm the same behavior on OSX.

Removing the "parallel" label as the underlying issue is not dependent on parallel execution. map has the same "leak".

amitmurthy avatar Mar 17 '16 10:03 amitmurthy

Truly my pleasure. I'm very glad to help make my favourite language better. Figured this issue is a good start. Let me know if you need anything else.

kswietli avatar Mar 17 '16 10:03 kswietli

The same issue is on 0.4.0 too with the above code. So the problem that @CodeLenz is seeing as a regression between 0.4.0 and 0.4.3 may be different.

amitmurthy avatar Mar 17 '16 11:03 amitmurthy

@kswietli , I have created a new issue at JuliaLang/julia#15543 for the problem identified by you since it may be different from the one reported by @CodeLenz .

amitmurthy avatar Mar 17 '16 14:03 amitmurthy

I'm not sure if this is the same issue, but I posted a question to the julia-users mailing list about memory usage AFTER calling pmap: https://groups.google.com/forum/#!topic/julia-users/GlnokJuz_54.

I wrote a simple, self-contained MWE here: https://gist.github.com/tcovert/3b9b982ba8b13f5984bebb63887b6def.

The TLDR is that after pmap() has done the work its supposed to do, all of the worker processes still have memory allocations (according to top) well in excess of what I was expecting.

tcovert avatar May 17 '16 15:05 tcovert

Closing as this is related to a very old version of Julia.

CodeLenz avatar Aug 03 '20 22:08 CodeLenz