Pathfinder.jl icon indicating copy to clipboard operation
Pathfinder.jl copied to clipboard

Non reproducible behaviour of `pathfinder` if run in parallel (even if every task gets passed its own RNG)

Open nsiccha opened this issue 7 months ago • 8 comments

As per the title. I have been unable to reproduce some results in a project I'm working on, and it seems to be due to pathfinder.

The following code results in output like the one below:

using Pathfinder, LogDensityProblems, Random, ForwardDiff, LogDensityProblemsAD
begin
    struct StdNormal
        dim::Int64
    end
    LogDensityProblems.dimension(p::StdNormal) = p.dim
    LogDensityProblems.logdensity(::StdNormal, x) = -.5sum(abs2, x)
    struct Funnel
        dim::Int64
    end
    LogDensityProblems.dimension(p::Funnel) = p.dim
    LogDensityProblems.logdensity(::Funnel, x) = -.5*x[1]^2-x[1]-.5sum(abs2, x[2:end]./exp(x[1]))

    n_rep = 100
    dim = 100
    rv = Vector{Any}(missing, n_rep)
    Threads.@threads for i in 1:n_rep
        rng = Xoshiro(1)
        problem = ADgradient(:ForwardDiff, StdNormal(dim))
        pathfinder(problem; rng)
        rv[i] = rng
    end
    display("Parallel normal"=>rv |> unique |> length)
    for i in 1:n_rep
        rng = Xoshiro(1)
        problem = ADgradient(:ForwardDiff, Funnel(dim))
        pathfinder(problem; rng)
        rv[i] = rng
    end
    display("Serial funnel"=>rv |> unique |> length)
    Threads.@threads for i in 1:n_rep
        rng = Xoshiro(1)
        problem = ADgradient(:ForwardDiff, Funnel(dim))
        pathfinder(problem; rng)
        rv[i] = rng
    end
    display("Parallel funnel"=>rv |> unique |> length)
end
"Parallel normal" => 1
"Serial funnel" => 1
┌ Warning: 2 (10.0%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (4.5%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (5.9%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (5.0%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (4.3%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (6.2%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (5.0%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (5.6%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
"Parallel funnel" => 11

with Pkg.status() output of

(Pathfinder) pkg> status
Status `~/github/nsiccha/mwe/Pathfinder/Project.toml`
⌃ [f6369f11] ForwardDiff v0.10.38
  [6fdf6af0] LogDensityProblems v2.1.2
  [996a588d] LogDensityProblemsAD v1.13.0
  [b1d3bc72] Pathfinder v0.9.16

nsiccha avatar May 12 '25 13:05 nsiccha

Thanks for reporting! It's not obvious to me whether this test should pass. What should be the case is that the results of Pathfinder itself should be the same if the initial RNG state is the same. Are you not seeing that?

If Pathfinder results are different, then to isolate the issue, if you replace the Pathfinder call within the loop with code that optimizes using Optim.LBFGS, does your test pass?

sethaxen avatar May 12 '25 13:05 sethaxen

I think for a simpler MWE, I actually did observe the same returned draw(s), but different post-pathfinder rng states. For the original problem, I'm unsure whether both the draws/distributions and the post-pathfinder rng states are different, or maybe just the rng.

For the above code, but with rv[i] = pathfinder(problem; rng).draws, I'm getting very similar output:

"Parallel normal" => 2
"Serial funnel" => 1
┌ Warning: 1 (6.2%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (4.5%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 2 (10.0%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (6.2%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (4.8%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (5.9%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (4.5%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
"Parallel funnel" => 30

Let me also check how it goes with only optimization. I do believe that the bug most originate somewhere else - I haven't found any loose rand(...) calls in Pathfinder.jl if that makes sense.

nsiccha avatar May 12 '25 14:05 nsiccha

Well, I have no idea what's going on, but things diverge (occasionally) at least as early as after the first line search in Optim.jl. As for why, I have no idea 🤷

nsiccha avatar May 12 '25 16:05 nsiccha

Looks like it's due to the way the default optimizer reuses global objects.

nsiccha avatar May 13 '25 08:05 nsiccha

Sounds like this is very bad. ALL THREADS will repeatedly and concurrently empty and overwrite DEFAULT_LINE_SEARCH's cache.

nsiccha avatar May 13 '25 09:05 nsiccha

Ah, good catch. This is indeed a pretty bad bug.

I don't think it's up to us to handle the case that any user-provided optimizer has state (i.e. if a user is calling pathfinder while multi-threading with a stateful, it's up to them to ensure each thread gets its own optimizer; Optimization itself does nothing to handle this), but we need to ensure that if they don't provide an optimizer, we still do the right thing.

Similarly, if they call multipathfinder with a user-provided stateful optimizer and a multi-threading executor, there's really no way for them to ensure the right thing happens. So IMO we should by default deepcopy the user-provided optimizer in multipathfinder nruns times so that each run gets its own. (we could offer a flag to turn this off to support uses like https://arxiv.org/abs/2201.11926 where the optimizer keeps track of modes it has encountered, but usage of these with Pathfinder is experimental and not officially supported.)

sethaxen avatar May 13 '25 10:05 sethaxen

I agree 100%. This bug has gone from "affects reproducibility" to "affects correctness".

nsiccha avatar May 13 '25 10:05 nsiccha

Sorry that I haven't gotten around to fixing this yet, will probably only manage to do so next week. In the meantime, the obvious workaround is to construct one optimizer per parallel (single-)pathfinder run, e.g. as

mypathfinder(args...; 
    history_length=6,
    optimizer=Pathfinder.Optim.LBFGS(; 
        m=history_length, 
        linesearch=Pathfinder.LineSearches.HagerZhang(), 
        alphaguess=Pathfinder.LineSearches.InitialHagerZhang()
    ),
    kwargs...
) = pathfinder(
    args...; 
    optimizer, kwargs...
)

nsiccha avatar May 21 '25 12:05 nsiccha