Pathfinder.jl icon indicating copy to clipboard operation
Pathfinder.jl copied to clipboard

make default_optimizer thread safe

Open nsiccha opened this issue 7 months ago • 16 comments

Fixes #248 for me - I believe, I haven't actually run nor added any tests yet.

Also, maybe this shouldn't be just merged into main? How does this go, @sethaxen?

nsiccha avatar May 13 '25 09:05 nsiccha

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 76.19%. Comparing base (20dd77e) to head (6971c21).

:exclamation: There is a different number of reports uploaded between BASE (20dd77e) and HEAD (6971c21). Click for more details.

HEAD has 148 uploads less than BASE
Flag BASE (20dd77e) HEAD (6971c21)
154 6
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #249      +/-   ##
==========================================
- Coverage   82.08%   76.19%   -5.89%     
==========================================
  Files          13       13              
  Lines         586      584       -2     
==========================================
- Hits          481      445      -36     
- Misses        105      139      +34     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar May 13 '25 09:05 codecov[bot]

I'm guessing this makes a bunch of tests fail which rely on the previous behavior?

nsiccha avatar May 13 '25 09:05 nsiccha

Also, maybe this shouldn't be just merged into main? How does this go, @sethaxen?

Pathfinder follows a continuous deployment model, so yes we'll merge this directly into main and immediately register a release. It's a non-breaking bug fix PR so just add a patch version bump.

sethaxen avatar May 13 '25 10:05 sethaxen

I'm guessing this makes a bunch of tests fail which rely on the previous behavior?

Seems to only be 2 tests: https://github.com/mlcolab/Pathfinder.jl/blob/20dd77e5514c2d5d65a2a5edfb2cf72e1dfb9758/test/singlepath.jl#L34-L35 https://github.com/mlcolab/Pathfinder.jl/blob/20dd77e5514c2d5d65a2a5edfb2cf72e1dfb9758/test/multipath.jl#L38-L39

Not certain if equality check would work here, but if not, just checking the optimizer is an LBFGS with the same m, linesearch and linesearch init types would be fine.

sethaxen avatar May 13 '25 10:05 sethaxen

Docs build failure is unrelated, and I'll fix in a separate PR

sethaxen avatar May 13 '25 10:05 sethaxen

Ah, okay! Will do. The new test would probably have to check that constructing a default_optimizer pre and post a pathfinder run results in identical-in-value but different-in-memory objects, if that makes sense. Will probably add this later today :+1:

nsiccha avatar May 13 '25 10:05 nsiccha

The new test would probably have to check that constructing a default_optimizer pre and post a pathfinder run results in identical-in-value but different-in-memory objects, if that makes sense.

I think that makes sense for testing that this particular approach we're using now works, but it would be even better to have a test that was independent of our default_optimizer but instead directly tested our invariants (things that we should be able to guarantee are true). Here our invariants would be:

  1. if you didn't pass a stateful optimizer/log-density function and if your RNG is thread-safe, then pathfinder should be thread-safe.
  2. multipathfinder with a thread-safe RNG and a multithreading executor should be thread-safe.

The only way I can think of to test thread-safeness is with reproducibility.

I'm thinking 2 tests:

  1. Call pathfinder multiple times in a multi-threaded loop with identically seeded thread-safe RNG for a nontrivial model (maybe the banana) without specifying the optimizer. Verify that results (e.g. trace, trace gradient, log-density, and draws) are numerically identical (with ==)
  2. Call multipathfinder with a user-constructed LBFGS (so they have a shared state), a thread-safe RNG, and PreferParallel executor. Reseed identically and re-run. Results should be identical.

We do have reproducibility tests here: https://github.com/mlcolab/Pathfinder.jl/blob/20dd77e5514c2d5d65a2a5edfb2cf72e1dfb9758/test/multipath.jl#L67-L82. My guess is that these are currently passing because the log-density is so trivial that very little time is spent in the linesearch, so the runs don't interfere with each other often.

Will probably add this later today 👍

I really appreciate it! Let me know if you'd like help with any of this.

sethaxen avatar May 13 '25 12:05 sethaxen

My guess is that these are currently passing because the log-density is so trivial that very little time is spent in the linesearch, so the runs don't interfere with each other often.

Right, there was also no issue in my example in the github issue for the parallel standard normal run.

nsiccha avatar May 13 '25 13:05 nsiccha

@nsiccha anything I can do to help out with this PR?

sethaxen avatar Jun 16 '25 09:06 sethaxen

@sethaxen, right, so sorry, I've been a lot busier than expected. I'll try to do it now :)

nsiccha avatar Jul 08 '25 09:07 nsiccha

Ah, I've been trying to construct a test that fails for multipathfinder, but was unable to. The reason being you already catching the multipathfinder+stateful optimizer case, see https://github.com/mlcolab/Pathfinder.jl/blob/f4ca90dc3d91f077f479d13904a2b6bf99e8ee25/src/multipath.jl#L181-L182

Which is probably why few if any other people have run into this issue...

I'll still finish the changes, but will only add a new test for a parallel pathfinder call without an explicitly passed optimizer, alright, @sethaxen?

nsiccha avatar Jul 08 '25 14:07 nsiccha

Ah, I've been trying to construct a test that fails for multipathfinder, but was unable to. The reason being you already catching the multipathfinder+stateful optimizer case, see

Ah, yes, it seems we do catch that case! Okay good, this bug wasn't as severe as it initially looked.

I'll still finish the changes, but will only add a new test for a parallel pathfinder call without an explicitly passed optimizer, alright, @sethaxen?

Yes, that sounds like the right approach. Thanks!

sethaxen avatar Jul 14 '25 08:07 sethaxen

this bug wasn't as severe as it initially looked.

Yeah, indeed. In retrospect, if it had affected everyone, I guess it would have been discovered earlier.

BTW, the reason why I was even using the parallel (single) pathfinder thing was because I wanted to initialize several chains for the same posterior in parallel, but I did not want all eventual initialization points to come from a single approximation, which AFAICT inevitably happens for high dimensions and importance resampling. I wanted to do something slightly more clever, but for that I'd needed to be able to match each draw to the approximation that generated it, but AFAICT that wasn't possible with multipathfinder. I guess in the end I could have used the fit_distributions from the PathfinderResult - I'm unsure why I didn't. Or actually, IIRC for some reason the multipathfinder method was actually much slower than my parallel pathfinder implementation?

I'm unsure, I eventually just stuck with the simple parallel pathfinder approach, which worked well for me (except for the race condition).

nsiccha avatar Jul 14 '25 08:07 nsiccha

BTW, the reason why I was even using the parallel (single) pathfinder thing was because I wanted to initialize several chains for the same posterior in parallel, but I did not want all eventual initialization points to come from a single approximation, which AFAICT inevitably happens for high dimensions and importance resampling. I wanted to do something slightly more clever, but for that I'd needed to be able to match each draw to the approximation that generated it, but AFAICT that wasn't possible with multipathfinder

The multi-pathfinder result object stores the individual single-path results, each of which stores the draws the chain generated, so you can always access those or, as you said, use the fit distribution for each path. You can also disable importance resampling with importance=False. It still resamples with replacement but does not use importance weights to do so. Then the draw_component_ids field stores for each draw in draws the index of the single-path run that generated that specific draw.

sethaxen avatar Jul 24 '25 08:07 sethaxen

Makes sense! I was in the end also affected by https://github.com/julia-vscode/julia-vscode/issues/3853, but wasn't aware at the time.

And also wrapped the (single) pathfinder calls in another loop, retrying pathfinder until NUTS initialization worked for the returned (initialization) draw, which mainly means until the gradient evaluation does not error.

Maybe this is actually something that Pathfinder should (optionally) check? That the log density (gradient) can be evaluated for the returned draws?

nsiccha avatar Jul 24 '25 08:07 nsiccha

Maybe this is actually something that Pathfinder should (optionally) check? That the log density (gradient) can be evaluated for the returned draws?

Oh that's interesting, can you open an issue for this feature?

sethaxen avatar Jul 24 '25 09:07 sethaxen