Distributed.jl icon indicating copy to clipboard operation
Distributed.jl copied to clipboard

RemoteRef memory leak when serialized to a different worker

Open amitmurthy opened this issue 10 years ago • 4 comments

Scenario:

  • Master creates a RemoteRef on worker 2
  • RemoteRef is sent to Worker 3 as part of message, but not used/assigned/stored on 3
  • Reference to RemoteRef is removed from 1 and gc() called everywhere
  • Reference continues to exist on 2 since it has not received a "del msg" from 3.
julia> addprocs(2)
2-element Array{Int64,1}:
 2
 3

julia> # create a reference on pid 2
       rr = RemoteRef(2)
RemoteRef(2,1,4)

julia> # See if anything has actually been created on worker 2
       Base.remote_do(2, ()->println(keys(Base.PGRP.refs)))

        From worker 2:  Any[]

julia> # Nope, nothing
       put!(rr, :OK)
RemoteRef(2,1,4)

julia> # Now, see again
       Base.remote_do(2, ()->println(keys(Base.PGRP.refs)))

        From worker 2:  Any[(1,4)]

julia> # It exists.

       # Let us send this reference to a 3rd worker.
       Base.remote_do(3, x->nothing, rr)

julia> # Check which workers that supposed to hold references to this RemoteRef
       Base.remote_do(2, ()->println(Base.PGRP.refs[(1,4)].clientset))

        From worker 2:  IntSet([1, 3])

julia> # 2 believes that 1 and 3 hold a reference

julia> # Clear locally and run gc()
       rr=nothing

julia> @everywhere gc()
julia> @everywhere gc()
julia> @everywhere gc()

julia> # 1 is cleared, but worker 2 believes that 3 continues to hold a reference
       Base.remote_do(2, ()->println(Base.PGRP.refs[(1,4)].clientset))

      From worker 2:  IntSet([3])
julia>  

I have tracked it down to finalizers not being called on the RemoteRef. The finalizer sends a del_msg to the processes actually holding the value.

Finalizers are not being called for regular objects too, when they are serialized to a remote worker.

julia> addprocs(2)
2-element Array{Int64,1}:
 2
 3

julia> # creates workers with pids 2 and 3

       @everywhere begin

       function finalize_foo(f)
           v = f.foo
           @schedule println("FOO finalized $v")
       end

       type Foo
           foo
           Foo(x) = (f=new(x); finalizer(f, finalize_foo); f)
       end

       function Base.serialize(s::SerializationState, f::Foo)
           invoke(serialize, Tuple{SerializationState, Any}, s, f)
       end

       function Base.deserialize(s::SerializationState, t::Type{Foo})
           f = invoke(deserialize, Tuple{SerializationState, DataType}, s, t)
           Foo(myid())
       end

       end

julia> Base.remote_do(3, x->nothing, Foo(0))
RemoteRef(3,1,6)

julia> @everywhere gc()
FOO finalized 0

julia> @everywhere gc()
julia> @everywhere gc()

As can be seen, Foo was not finalized on worker 3.

cc: @carnaval , @JeffBezanson

amitmurthy avatar Jun 17 '15 05:06 amitmurthy

Some progress:

addprocs(2)
rr = RemoteRef(2)
put!(rr, :OK)
Base.remote_do(3, x->nothing, rr)
rr=nothing
@everywhere gc()
@everywhere gc()
Base.remote_do(2, ()->println(Base.PGRP.refs[(1,4)].clientset))

# Execute a dummy remote_do again. This collects the previous ref
Base.remote_do(3, myid)
@everywhere gc()
Base.remote_do(2, ()->println(Base.PGRP.refs))

The second remote_do results in the reference finally being collected.

I tried changing https://github.com/JuliaLang/julia/blob/dbe94d156bbb07f0c30af6b49a42ab09416f5df7/base/multi.jl#L838-L846 to

            elseif is(msg, :do)
                f = deserialize(r_stream)
                args = deserialize(r_stream)
                #print("got args: $args\n")
                let f=f, args=args
                    @schedule begin
                        run_work_thunk(RemoteValue(), ()->f(args...))
                        f = nothing
                        args = nothing
                    end
                end
                f = nothing
                args = nothing

but that doesn't help.

Do let blocks also keep references? How do we clear them?

amitmurthy avatar Jun 17 '15 07:06 amitmurthy

Simpler example:

function foo(rr)
    while true
        b=take!(rr)
        let b=b
            f = x->nothing
            @schedule ()->f(b)
            b = nothing
        end
        b=nothing
    end
end

rr = RemoteRef()
@schedule foo(rr)

put!(rr, ones(10^8));
gc()
gc()
gc()

A reference to the array is held till the loop is entered again, say by a put!(rr, :OK). The remote ref does not actually have a reference as evidenced by

julia> isready(rr)
false

julia> Base.PGRP.refs
Dict{Any,Any} with 3 entries:
  (1,0) => Base.RemoteValue(false,nothing,Condition(Any[Task (waiting) @0x00007f8d3187f850]),Condition(Any[]),IntSet([1]),0)
  (1,2) => Base.RemoteValue(false,nothing,Condition(Any[Task (waiting) @0x00007f8d325fb6c0]),Condition(Any[]),IntSet([1]),0)
  (1,1) => Base.RemoteValue(false,nothing,Condition(Any[]),Condition(Any[]),IntSet([1]),0)

Removing the let statement makes the problem go away.

amitmurthy avatar Jun 17 '15 13:06 amitmurthy

Forgot to add a comment : as we discussed yesterday this seems to be because the value is stored into a temporary gensym. @vtjnash ?

carnaval avatar Jun 21 '15 17:06 carnaval

@yuyichao this - https://github.com/JuliaLang/Distributed.jl/issues/25 - is still an issue. Any ideas?

amitmurthy avatar May 22 '17 05:05 amitmurthy