Mamba.jl icon indicating copy to clipboard operation
Mamba.jl copied to clipboard

parallel MCMC on 0.5

Open tpapp opened this issue 7 years ago • 23 comments

Currently pmap seems to be disabled for v"0.5.0". I am wondering if there is a workaround, eg running the samples separately and then somehow merging. Usually run 3–5 chains and the speed gains would be significant.

tpapp avatar Nov 29 '16 14:11 tpapp

Issue is with Julia's pmap, we are waiting on some updates to it: https://github.com/JuliaLang/julia/issues/19000

bdeonovic avatar Nov 29 '16 18:11 bdeonovic

What exactly is the issue? While the keyword arguments to pmap have changed, the behavior has not.

amitmurthy avatar Nov 30 '16 03:11 amitmurthy

I think @brian-j-smith will be better able to answer this question, but if I recall the issue was that not all anonymous functions were being sent out to the workers.

bdeonovic avatar Nov 30 '16 03:11 bdeonovic

AFAIK the issue is with capturing globals in the anonymous function. Which existed prior to 0.5 too. A workaround for that issue is to wrap the closure creation in a let block.

julia> a=1
1

julia> remotecall_fetch(()->a, 2)
ERROR: On worker 2:
UndefVarError: a not defined
 in #1 at ./REPL[3]:1
.....

julia> let a=a; remotecall_fetch(()->a, 2); end
1

amitmurthy avatar Nov 30 '16 03:11 amitmurthy

The pmap errors we are getting now in 0.5 do not occur in 0.4. Below is the 0.5 error, for what its worth. When I get some free time, I'll try to come up with a stand-alone test case that reproduces the error.

ERROR: LoadError: On worker 2:
UndefVarError: ##298#299 not defined
 in deserialize_datatype at .\serialize.jl:823
 in handle_deserialize at .\serialize.jl:571
 in deserialize at .\serialize.jl:881
 in deserialize_datatype at .\serialize.jl:835
 in handle_deserialize at .\serialize.jl:571
 in deserialize at .\serialize.jl:894
 in deserialize_datatype at .\serialize.jl:835
 in handle_deserialize at .\serialize.jl:571
 in deserialize at .\serialize.jl:881
 in deserialize_datatype at .\serialize.jl:835
 in handle_deserialize at .\serialize.jl:571
 in deserialize_array at .\serialize.jl:709
 in handle_deserialize at .\serialize.jl:569
 in deserialize at .\serialize.jl:541
 in ntuple at .\tuple.jl:65
 in handle_deserialize at .\serialize.jl:562
 in deserialize_msg at .\multi.jl:120
 in message_handler_loop at .\multi.jl:1317
 in process_tcp_streams at .\multi.jl:1276
 in #620 at .\event.jl:68
 in #remotecall_fetch#608(::Array{Any,1}, ::Function, ::Function, ::Base.Worker,
 ::Array{Any,1}, ::Vararg{Array{Any,1},N}) at .\multi.jl:1070
 in remotecall_fetch(::Function, ::Base.Worker, ::Array{Any,1}, ::Vararg{Array{A
ny,1},N}) at .\multi.jl:1062
 in #remotecall_fetch#611(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Arr
ay{Any,1}, ::Vararg{Array{Any,1},N}) at .\multi.jl:1080
 in remotecall_fetch(::Function, ::Int64, ::Array{Any,1}, ::Vararg{Array{Any,1},
N}) at .\multi.jl:1080
 in #remotecall_pool#691(::Array{Any,1}, ::Function, ::Function, ::Function, ::W
orkerPool, ::Array{Any,1}, ::Vararg{Array{Any,1},N}) at .\workerpool.jl:93
 in remotecall_pool(::Function, ::Function, ::WorkerPool, ::Array{Any,1}, ::Vara
rg{Array{Any,1},N}) at .\workerpool.jl:91
 in #remotecall_fetch#694(::Array{Any,1}, ::Function, ::Function, ::WorkerPool,
::Array{Any,1}, ::Vararg{Array{Any,1},N}) at .\workerpool.jl:124
 in remotecall_fetch(::Function, ::WorkerPool, ::Array{Any,1}, ::Vararg{Array{An
y,1},N}) at .\workerpool.jl:124
 in (::Base.###699#700#702{WorkerPool,Mamba.#mcmc_worker!})(::Array{Any,1}, ::Fu
nction, ::Array{Any,1}, ::Vararg{Array{Any,1},N}) at .\workerpool.jl:151
 in (::Base.##699#701)(::Array{Any,1}, ::Vararg{Array{Any,1},N}) at .\workerpool
.jl:151
 in macro expansion at .\asyncmap.jl:63 [inlined]
 in (::Base.##757#759{Base.AsyncCollector,Base.AsyncCollectorState})() at .\task
.jl:360

brian-j-smith avatar Nov 30 '16 03:11 brian-j-smith

This is the reduced case:

julia> c = x->x
(anonymous function)

julia> d = x->c(x)
(anonymous function)

On 0.4:

julia> pmap(d, 1:2)
2-element Array{Any,1}:
 1
 2

On 0.5:

julia> pmap(d, 1:2)
ERROR: On worker 2:
UndefVarError: c not defined
 in #3 at ./REPL[2]:1

amitmurthy avatar Nov 30 '16 04:11 amitmurthy

Ref: https://github.com/JuliaLang/julia/issues/19456

amitmurthy avatar Nov 30 '16 05:11 amitmurthy

Great! Thanks for coming up with a reduced case. I'll keep an eye on your other thread and possible developments in julia to address this. A work-around on our package side may be challenging since some anonymous functions are user-supplied.

brian-j-smith avatar Nov 30 '16 14:11 brian-j-smith

@amitmurthy still having pmap issues in v0.6

julia> include("../deonovib\\.julia/v0.6\\Mamba\\doc/examples/rats.jl")
MCMC Simulation of 10000 Iterations x 2 Chains...

ERROR: LoadError: On worker 2:
UndefVarError: ##332#333 not defined
deserialize_datatype at .\serialize.jl:969
handle_deserialize at .\serialize.jl:674
deserialize at .\serialize.jl:634
handle_deserialize at .\serialize.jl:681
deserialize at .\serialize.jl:1075
handle_deserialize at .\serialize.jl:687
deserialize at .\serialize.jl:1088
handle_deserialize at .\serialize.jl:687
deserialize at .\serialize.jl:1075
handle_deserialize at .\serialize.jl:687
deserialize_array at .\serialize.jl:871
handle_deserialize at .\serialize.jl:672
deserialize at .\serialize.jl:634
ntuple at .\tuple.jl:108
handle_deserialize at .\serialize.jl:664
deserialize_msg at .\distributed\messages.jl:98
message_handler_loop at .\distributed\process_messages.jl:161
process_tcp_streams at .\distributed\process_messages.jl:118
#99 at .\event.jl:73
Stacktrace:
 [1] #570 at .\asyncmap.jl:178 [inlined]
 [2] foreach(::Base.##570#572, ::Array{Any,1}) at .\abstractarray.jl:1731
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Array{Array{Any,1},1}, ::Vararg{Array{Array{Any,1},1},N} where N) at .\asyncmap.jl:178
 [4] wrap_n_exec_twice(::Channel{Any}, ::Array{Any,1}, ::Base.Distributed.##204#207{WorkerPool}, ::Function, ::Array{Array{Any,1},1}, ::Vararg{Array{Array{Any,1},1},N} where N) at .\asyncmap.jl:154
 [5] #async_usemap#555(::Function, ::Void, ::Function, ::Base.Distributed.##188#190, ::Array{Array{Any,1},1}, ::Vararg{Array{Array{Any,1},1},N} where N) at .\asyncmap.jl:103

 [6] (::Base.#kw##async_usemap)(::Array{Any,1}, ::Base.#async_usemap, ::Function, ::Array{Array{Any,1},1}, ::Vararg{Array{Array{Any,1},1},N} where N) at .\<missing>:0
 [7] (::Base.#kw##asyncmap)(::Array{Any,1}, ::Base.#asyncmap, ::Function, ::Array{Array{Any,1},1}) at .\<missing>:0
 [8] #pmap#203(::Bool, ::Int64, ::Void, ::Array{Any,1}, ::Void, ::Function, ::WorkerPool, ::Function, ::Array{Array{Any,1},1}) at .\distributed\pmap.jl:126
 [9] pmap(::WorkerPool, ::Function, ::Array{Array{Any,1},1}) at .\distributed\pmap.jl:101
 [10] #pmap#213(::Array{Any,1}, ::Function, ::Function, ::Array{Array{Any,1},1}) at .\distributed\pmap.jl:156
 [11] pmap2(::Function, ::Array{Array{Any,1},1}) at C:\Users\deonovib\.julia\v0.6\Mamba\src\utils.jl:97
 [12] mcmc_master!(::Mamba.Model, ::UnitRange{Int64}, ::Int64, ::Int64, ::UnitRange{Int64}, ::Bool) at C:\Users\deonovib\.julia\v0.6\Mamba\src\model\mcmc.jl:52
 [13] #mcmc#29(::Int64, ::Int64, ::Int64, ::Bool, ::Function, ::Mamba.Model, ::Dict{Symbol,Any}, ::Array{Dict{Symbol,Any},1}, ::Int64) at C:\Users\deonovib\.julia\v0.6\Mamba\src\model\mcmc.jl:32
 [14] (::Mamba.#kw##mcmc)(::Array{Any,1}, ::Mamba.#mcmc, ::Mamba.Model, ::Dict{Symbol,Any}, ::Array{Dict{Symbol,Any},1}, ::Int64) at .\<missing>:0
 [15] include_from_node1(::String) at .\loading.jl:569
 [16] include(::String) at .\sysimg.jl:14
while loading C:\Users\deonovib\.julia\v0.6\Mamba\doc\examples\rats.jl, in expression starting on line 121

bdeonovic avatar Jul 17 '17 18:07 bdeonovic

Does your code serialize the same anonymous function across multiple workers? i.e., it is initially defined on wrkr_1 which sends it to wrkr_2 which sends it to wrkr_3 and so on (see code example https://github.com/JuliaLang/julia/pull/22675#issuecomment-313000815). If so, this should be fixed by https://github.com/JuliaLang/julia/pull/22675 which is backport pending onto 0.6.

amitmurthy avatar Jul 18 '17 03:07 amitmurthy

I don't think that is the issue. The relevant code seems like a straight forward use of pmap

  lsts = [
    Any[m, states[k], window, burnin, thin, ChainProgress(frame, k, N)]
    for k in chains
  ]
  results = pmap(mcmc_worker!, lsts)

where m is a Type that contains anonymous functions. I think m is the real issue. it is:

  type Model
    nodes::Dict{Symbol, Any}
    samplers::Vector{Sampler}
    states::Vector{ModelState}
    iter::Int
    burnin::Int
    hasinputs::Bool
    hasinits::Bool
  end

and Sampler is

  type Sampler{T}
    params::Vector{Symbol}
    eval::Function
    tune::T
    targets::Vector{Symbol}
  end

where a Sampler might look something like:

function RWM{T<:Real}(params::ElementOrVector{Symbol},
                      scale::ElementOrVector{T}; args...)
  samplerfx = function(model::Model, block::Integer)
    block = SamplingBlock(model, block, true)
    v = SamplerVariate(block, scale; args...)
    sample!(v, x -> logpdf!(block, x))
    relist(block, v)
  end
  Sampler(params, samplerfx, RWMTune())
end

so I think we are having trouble passing m to workers because it has quite a heavy nesting of types, named functions, and anonymous functions.

bdeonovic avatar Jul 18 '17 13:07 bdeonovic

Is there a sample code that I can test with locally? I assume all dependent packages are publicly available?

amitmurthy avatar Jul 18 '17 17:07 amitmurthy

Currently we have it hacked such that only sequential processing is possible. Simply Pkg.add("Mamba") and then modify the file .julia/v0.6/Mamba/src/utils.jl by adding comment before Version hack.

...
  if (nprocs() > 1) #& (VERSION < v"0.5-")
...

Run any of the examples that utilize multiple chains you will get the error.

e.g.

 include(".julia/v0.6/Mamba/doc/examples/rats.jl")

or after doing the include you can also replicate error with

remotecall_fetch(fieldnames, 2, model)

which is a little more direct and shows that the issue is sending model out to the different workers.

bdeonovic avatar Jul 18 '17 19:07 bdeonovic

On OSX I had trouble installing Mamba (Julia 0.6) . Gadfly related errors - does not work even with a checkout of Gadfly master. Will try again after a few days.

amitmurthy avatar Jul 19 '17 16:07 amitmurthy

What was the error? Seems strange since Travis shows it installing and all tests pass on OSX https://travis-ci.org/brian-j-smith/Mamba.jl

bdeonovic avatar Jul 19 '17 17:07 bdeonovic

Fails with

WARNING: could not import Multimedia.@try_display into Gadfly
ERROR: LoadError: UndefVarError: @try_display not defined
Stacktrace:
 [1] include_from_node1(::String) at ./loading.jl:569
 [2] include(::String) at ./sysimg.jl:14
 [3] anonymous at ./<missing>:2
while loading /Users/amitm/.julia/v0.6/Gadfly/src/Gadfly.jl, in expression starting on line 1052
ERROR: LoadError: Failed to precompile Gadfly to /Users/amitm/.julia/lib/v0.6/Gadfly.ji.
Stacktrace:
 [1] compilecache(::String) at ./loading.jl:703
 [2] _require(::Symbol) at ./loading.jl:490

amitmurthy avatar Jul 20 '17 03:07 amitmurthy

I'll delete the v0.6 packages directory and try again.

amitmurthy avatar Jul 20 '17 04:07 amitmurthy

The immediate culprit appears to be model.nodes[:s2_alpha].eval with

julia> typeof(model.nodes[:s2_alpha].eval)
Mamba.##338#339

It is not being serialized as it is defined under Mamba. Where and how is it being initialized?

amitmurthy avatar Jul 20 '17 05:07 amitmurthy

julia> addprocs(2)
julia> @everywhere module Foo
           type Bar
               x
               Bar() = new(()->1) 
           end
       end

WARNING: deprecated syntax "type" at REPL[2]:2.
Use "mutable struct" instead.

julia> bar = Foo.Bar()
Foo.Bar(Foo.#1)

julia> remotecall_fetch(typeof, 2, bar.x)
Foo.##1#2

However, if the anonymous function is explicitly created under module Foo

julia> bar.x = eval(Foo, :(()->1))
(::#3) (generic function with 1 method)

julia> remotecall_fetch(typeof, 2, bar.x)
ERROR: On worker 2:
UndefVarError: ##3#4 not defined
deserialize_datatype at ./serialize.jl:986
handle_deserialize at ./serialize.jl:687
deserialize at ./serialize.jl:647

amitmurthy avatar Jul 20 '17 05:07 amitmurthy

julia> addprocs(2);

julia> @everywhere module Foo
           mutable struct Bar
               x
               Bar() = new(()->1) 
           end

           make_bar() = (b=Bar(); b.x=()->1; b)
       end;


julia> bar = Foo.Bar();

julia> typeof(bar.x)
Foo.##1#2

julia> remotecall_fetch(typeof, 2, bar.x); # OK

julia> bar = Foo.make_bar();

julia> typeof(bar.x)
Foo.##3#4

julia> remotecall_fetch(typeof, 2, bar.x); # OK

julia> bar.x = eval(Foo, :(()->1));

julia> typeof(bar.x)
Foo.##5#6

julia> remotecall_fetch(typeof, 2, bar.x); # Not OK
ERROR: On worker 2:
UndefVarError: ##5#6 not defined
deserialize_datatype at ./serialize.jl:986
handle_deserialize at ./serialize.jl:687

Only when the anonymous function is explicitly defined under Foo does there seem to be a problem.

amitmurthy avatar Jul 20 '17 07:07 amitmurthy

Thanks for the extensive analysis @amitmurthy what is the next step? Is this something that needs to be addressed in our package or is this something on the julialang side that could be addressed?

bdeonovic avatar Jul 20 '17 13:07 bdeonovic

Both I think.

You can check if you can avoid an eval under module Mamba. I'll open an issue on Julialang for addressing this.

amitmurthy avatar Jul 20 '17 14:07 amitmurthy

You can follow the discussion on the linked issue above.

Meanwhile, I could run rats.jl in parallel on 0.6 by eval'ing into Main.

diff --git a/src/utils.jl b/src/utils.jl
index 438dbd3..aee2a00 100644
--- a/src/utils.jl
+++ b/src/utils.jl
@@ -7,7 +7,7 @@ end
 function modelfxsrc(literalargs::Vector{Tuple{Symbol, DataType}}, f::Function)
   args = Expr(:tuple, map(arg -> Expr(:(::), arg[1], arg[2]), literalargs)...)
   expr, src = modelexprsrc(f, literalargs)
-  fx = eval(Expr(:function, args, expr))
+  fx = eval(Main, Expr(:function, args, expr))
   (fx, src)
 end
 
@@ -92,7 +92,7 @@ end
 ## called and will apply its error processing.
 
 function pmap2(f::Function, lsts::AbstractArray)
-  if (nprocs() > 1) & (VERSION < v"0.5-")
+  if (nprocs() > 1) #& (VERSION < v"0.5-")
     @everywhere importall Mamba
     pmap(f, lsts)
   else
diff --git a/src/variate.jl b/src/variate.jl
index 8875ad0..1c6b91c 100644
--- a/src/variate.jl
+++ b/src/variate.jl
@@ -112,7 +112,7 @@ const BinaryScalarMethods = [
 ]
 
 for op in BinaryScalarMethods
-  @eval ($op)(x::ScalarVariate, y::ScalarVariate) = ($op)(x.value, y.value)
+  @eval Main ($op)(x::Mamba.ScalarVariate, y::Mamba.ScalarVariate) = ($op)(x.value, y.value)
 end
 
 const RoundScalarMethods = [
@@ -123,8 +123,8 @@ const RoundScalarMethods = [
 ]
 
 for op in RoundScalarMethods
-  @eval ($op)(x::ScalarVariate) = ($op)(x.value)
-  @eval ($op){T}(::Type{T}, x::ScalarVariate) = ($op)(T, x.value)
+  @eval Main ($op)(x::Mamba.ScalarVariate) = ($op)(x.value)
+  @eval Main ($op){T}(::Type{T}, x::Mamba.ScalarVariate) = ($op)(T, x.value)
 end
 
 const UnaryScalarMethods = [
@@ -142,5 +142,5 @@ const UnaryScalarMethods = [
 ]
 
 for op in UnaryScalarMethods
-  @eval ($op)(x::ScalarVariate) = ($op)(x.value)
+  @eval Main ($op)(x::Mamba.ScalarVariate) = ($op)(x.value)
 end

amitmurthy avatar Jul 21 '17 04:07 amitmurthy