RemoteRef memory leak when serialized to a different worker
Scenario:
- Master creates a RemoteRef on worker 2
- RemoteRef is sent to Worker 3 as part of message, but not used/assigned/stored on 3
- Reference to RemoteRef is removed from 1 and gc() called everywhere
- Reference continues to exist on 2 since it has not received a "del msg" from 3.
julia> addprocs(2)
2-element Array{Int64,1}:
2
3
julia> # create a reference on pid 2
rr = RemoteRef(2)
RemoteRef(2,1,4)
julia> # See if anything has actually been created on worker 2
Base.remote_do(2, ()->println(keys(Base.PGRP.refs)))
From worker 2: Any[]
julia> # Nope, nothing
put!(rr, :OK)
RemoteRef(2,1,4)
julia> # Now, see again
Base.remote_do(2, ()->println(keys(Base.PGRP.refs)))
From worker 2: Any[(1,4)]
julia> # It exists.
# Let us send this reference to a 3rd worker.
Base.remote_do(3, x->nothing, rr)
julia> # Check which workers that supposed to hold references to this RemoteRef
Base.remote_do(2, ()->println(Base.PGRP.refs[(1,4)].clientset))
From worker 2: IntSet([1, 3])
julia> # 2 believes that 1 and 3 hold a reference
julia> # Clear locally and run gc()
rr=nothing
julia> @everywhere gc()
julia> @everywhere gc()
julia> @everywhere gc()
julia> # 1 is cleared, but worker 2 believes that 3 continues to hold a reference
Base.remote_do(2, ()->println(Base.PGRP.refs[(1,4)].clientset))
From worker 2: IntSet([3])
julia>
I have tracked it down to finalizers not being called on the RemoteRef. The finalizer sends a del_msg to the processes actually holding the value.
Finalizers are not being called for regular objects too, when they are serialized to a remote worker.
julia> addprocs(2)
2-element Array{Int64,1}:
2
3
julia> # creates workers with pids 2 and 3
@everywhere begin
function finalize_foo(f)
v = f.foo
@schedule println("FOO finalized $v")
end
type Foo
foo
Foo(x) = (f=new(x); finalizer(f, finalize_foo); f)
end
function Base.serialize(s::SerializationState, f::Foo)
invoke(serialize, Tuple{SerializationState, Any}, s, f)
end
function Base.deserialize(s::SerializationState, t::Type{Foo})
f = invoke(deserialize, Tuple{SerializationState, DataType}, s, t)
Foo(myid())
end
end
julia> Base.remote_do(3, x->nothing, Foo(0))
RemoteRef(3,1,6)
julia> @everywhere gc()
FOO finalized 0
julia> @everywhere gc()
julia> @everywhere gc()
As can be seen, Foo was not finalized on worker 3.
cc: @carnaval , @JeffBezanson
Some progress:
addprocs(2)
rr = RemoteRef(2)
put!(rr, :OK)
Base.remote_do(3, x->nothing, rr)
rr=nothing
@everywhere gc()
@everywhere gc()
Base.remote_do(2, ()->println(Base.PGRP.refs[(1,4)].clientset))
# Execute a dummy remote_do again. This collects the previous ref
Base.remote_do(3, myid)
@everywhere gc()
Base.remote_do(2, ()->println(Base.PGRP.refs))
The second remote_do results in the reference finally being collected.
I tried changing https://github.com/JuliaLang/julia/blob/dbe94d156bbb07f0c30af6b49a42ab09416f5df7/base/multi.jl#L838-L846 to
elseif is(msg, :do)
f = deserialize(r_stream)
args = deserialize(r_stream)
#print("got args: $args\n")
let f=f, args=args
@schedule begin
run_work_thunk(RemoteValue(), ()->f(args...))
f = nothing
args = nothing
end
end
f = nothing
args = nothing
but that doesn't help.
Do let blocks also keep references? How do we clear them?
Simpler example:
function foo(rr)
while true
b=take!(rr)
let b=b
f = x->nothing
@schedule ()->f(b)
b = nothing
end
b=nothing
end
end
rr = RemoteRef()
@schedule foo(rr)
put!(rr, ones(10^8));
gc()
gc()
gc()
A reference to the array is held till the loop is entered again, say by a put!(rr, :OK). The remote ref does not actually have a reference as evidenced by
julia> isready(rr)
false
julia> Base.PGRP.refs
Dict{Any,Any} with 3 entries:
(1,0) => Base.RemoteValue(false,nothing,Condition(Any[Task (waiting) @0x00007f8d3187f850]),Condition(Any[]),IntSet([1]),0)
(1,2) => Base.RemoteValue(false,nothing,Condition(Any[Task (waiting) @0x00007f8d325fb6c0]),Condition(Any[]),IntSet([1]),0)
(1,1) => Base.RemoteValue(false,nothing,Condition(Any[]),Condition(Any[]),IntSet([1]),0)
Removing the let statement makes the problem go away.
Forgot to add a comment : as we discussed yesterday this seems to be because the value is stored into a temporary gensym. @vtjnash ?
@yuyichao this - https://github.com/JuliaLang/Distributed.jl/issues/25 - is still an issue. Any ideas?