Non reproducible behaviour of `pathfinder` if run in parallel (even if every task gets passed its own RNG)
As per the title. I have been unable to reproduce some results in a project I'm working on, and it seems to be due to pathfinder.
The following code results in output like the one below:
using Pathfinder, LogDensityProblems, Random, ForwardDiff, LogDensityProblemsAD
begin
struct StdNormal
dim::Int64
end
LogDensityProblems.dimension(p::StdNormal) = p.dim
LogDensityProblems.logdensity(::StdNormal, x) = -.5sum(abs2, x)
struct Funnel
dim::Int64
end
LogDensityProblems.dimension(p::Funnel) = p.dim
LogDensityProblems.logdensity(::Funnel, x) = -.5*x[1]^2-x[1]-.5sum(abs2, x[2:end]./exp(x[1]))
n_rep = 100
dim = 100
rv = Vector{Any}(missing, n_rep)
Threads.@threads for i in 1:n_rep
rng = Xoshiro(1)
problem = ADgradient(:ForwardDiff, StdNormal(dim))
pathfinder(problem; rng)
rv[i] = rng
end
display("Parallel normal"=>rv |> unique |> length)
for i in 1:n_rep
rng = Xoshiro(1)
problem = ADgradient(:ForwardDiff, Funnel(dim))
pathfinder(problem; rng)
rv[i] = rng
end
display("Serial funnel"=>rv |> unique |> length)
Threads.@threads for i in 1:n_rep
rng = Xoshiro(1)
problem = ADgradient(:ForwardDiff, Funnel(dim))
pathfinder(problem; rng)
rv[i] = rng
end
display("Parallel funnel"=>rv |> unique |> length)
end
"Parallel normal" => 1
"Serial funnel" => 1
┌ Warning: 2 (10.0%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (4.5%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (5.9%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (5.0%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (4.3%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (6.2%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (5.0%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (5.6%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
"Parallel funnel" => 11
with Pkg.status() output of
(Pathfinder) pkg> status
Status `~/github/nsiccha/mwe/Pathfinder/Project.toml`
⌃ [f6369f11] ForwardDiff v0.10.38
[6fdf6af0] LogDensityProblems v2.1.2
[996a588d] LogDensityProblemsAD v1.13.0
[b1d3bc72] Pathfinder v0.9.16
Thanks for reporting! It's not obvious to me whether this test should pass. What should be the case is that the results of Pathfinder itself should be the same if the initial RNG state is the same. Are you not seeing that?
If Pathfinder results are different, then to isolate the issue, if you replace the Pathfinder call within the loop with code that optimizes using Optim.LBFGS, does your test pass?
I think for a simpler MWE, I actually did observe the same returned draw(s), but different post-pathfinder rng states. For the original problem, I'm unsure whether both the draws/distributions and the post-pathfinder rng states are different, or maybe just the rng.
For the above code, but with rv[i] = pathfinder(problem; rng).draws, I'm getting very similar output:
"Parallel normal" => 2
"Serial funnel" => 1
┌ Warning: 1 (6.2%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (4.5%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 2 (10.0%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (6.2%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (4.8%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (5.9%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
┌ Warning: 1 (4.5%) updates to the inverse Hessian estimate were rejected to keep it positive definite.
└ @ Pathfinder ~/.julia/packages/Pathfinder/Dppsn/src/singlepath.jl:227
"Parallel funnel" => 30
Let me also check how it goes with only optimization. I do believe that the bug most originate somewhere else - I haven't found any loose rand(...) calls in Pathfinder.jl if that makes sense.
Well, I have no idea what's going on, but things diverge (occasionally) at least as early as after the first line search in Optim.jl. As for why, I have no idea 🤷
Looks like it's due to the way the default optimizer reuses global objects.
Sounds like this is very bad. ALL THREADS will repeatedly and concurrently empty and overwrite DEFAULT_LINE_SEARCH's cache.
Ah, good catch. This is indeed a pretty bad bug.
I don't think it's up to us to handle the case that any user-provided optimizer has state (i.e. if a user is calling pathfinder while multi-threading with a stateful, it's up to them to ensure each thread gets its own optimizer; Optimization itself does nothing to handle this), but we need to ensure that if they don't provide an optimizer, we still do the right thing.
Similarly, if they call multipathfinder with a user-provided stateful optimizer and a multi-threading executor, there's really no way for them to ensure the right thing happens. So IMO we should by default deepcopy the user-provided optimizer in multipathfinder nruns times so that each run gets its own. (we could offer a flag to turn this off to support uses like https://arxiv.org/abs/2201.11926 where the optimizer keeps track of modes it has encountered, but usage of these with Pathfinder is experimental and not officially supported.)
I agree 100%. This bug has gone from "affects reproducibility" to "affects correctness".
Sorry that I haven't gotten around to fixing this yet, will probably only manage to do so next week. In the meantime, the obvious workaround is to construct one optimizer per parallel (single-)pathfinder run, e.g. as
mypathfinder(args...;
history_length=6,
optimizer=Pathfinder.Optim.LBFGS(;
m=history_length,
linesearch=Pathfinder.LineSearches.HagerZhang(),
alphaguess=Pathfinder.LineSearches.InitialHagerZhang()
),
kwargs...
) = pathfinder(
args...;
optimizer, kwargs...
)