root icon indicating copy to clipboard operation
root copied to clipboard

[RF] LikelihoodJob improved precision

Open egpbos opened this issue 2 years ago • 16 comments

This Pull request:

Changes or fixes:

Triggered by a failure to run an ATLAS Higgs combination fit using the MultiProcess-parallelized LikelihoodJob, this PR changes the way offsetting is handled in the TestStatistics::LikelihoodWrapper family of classes. This in turn improves precision of evaluations, making many parallelized likelihood evaluations now bit-wise exactly equal to non-parallel evaluations. This allows us to better debug the Higgs fit. After this PR, the public Higgs workspace used in rootbench can be fit with LikelihoodJob enabled.

In detail, this PR changes the following:

  • Increased precision:

    • Per-component offsets: instead of one offset for the total LikelihoodWrapper, we switched to a vector of offsets: one for each likelihood component. This makes a difference only for RooSumL fits, i.e. simultaneous PDF fits or fits with constraint or global observable terms. This brings the results of these fits closer to the old-style RooNLLVar fits, because those also use per-component offsets (per-RooNLLVar in a RooAddition to be exact).
    • In LikelihoodJob::evaluate, the result_ KahanSum is no longer initialized to zero, but is initialized to the first value in the results_ array, both sum and carry term. This sometimes makes a difference: adding a term with a small but non-zero carry term to an existing sum with a zero sum and zero carry term can make the small non-zero carry term disappear.
    • Due to these changes and the earlier KahanSum updates, we were able to tighten the tolerance of tests in testLikelihoodSerial, testLikelihoodJob and testLikelihoodGradientJob, with many tests now passing EXPECT_EQ.
    • testLikelihoodGradientJob adds offsetting to the parameterized test matrices of the LikelihoodGradientJobTest cases to test all the above (and below) changes.
  • Offset synchronization:

    • LikelihoodWrapper and LikelihoodGradientWrapper now store a shared_ptr to the offsets instead of raw offsets. At construction time within a MinuitFcnGrad, they get passed the same single offset object so that it is always kept synced between the different likelihood calculators. If it isn't synced and the likelihood gets different offsets at different times, the minimizer understandably gets very confused. This was the case before this commit, which was, in fact, a bug.
    • For LikelihoodJob and LikelihoodGradientJob, the offsets are also synchronized to all workers via update_state. For this, we simply check before evaluation whether offsets have changed since the previous evaluation and if so they are sent over the MultiProcess::Messenger. Note that while the LikelihoodGradientJob doesn't itself use the offset (it doesn't calculate the likelihood), it must still synchronize offsets, because during the gradient calculation the LikelihoodSerial objects on the workers are used to calculate the likelihoods there, so for them the offsets must be up to date.
    • The LikelihoodJob contains a LikelihoodSerial member as well now. This allows the LikelihoodJob to trigger calculating the offsets on the master process before sending them to the workers.
    • LikelihoodWrapper has some added private helper functions for managing offsets.
  • Other miscellaneous changes:

    • LikelihoodWrapper::setApplyWeightSquared was implemented properly for RooSumL likelihoods as well, passing along the call to component RooUnbinnedLs. Note, however, that it is currently not reachable for users because there is no interface to pass this along from the RooMinimizer, which contains the MinuitFcnGrad, which contains the LikelihoodWrapper. A comment in MinuitFcnGrad.h refers to this, reminding future devs to also flip offsets_reset_ when (un)setting squared weights.
    • LikelihoodWrapper now holds the likelihood_type. This cleans up some code duplication in LikelihoodSerial and LikelihoodJob, which previously used the type only in their ctors, and avoids dynamic_casts later, e.g. on when calculating offsets.
    • A RooSubsidiaryL can now also be computed with LikelihoodSerial and LikelihoodJob; this case was still missing in their evaluation functions.
    • The LikelihoodSerial, LikelihoodJob and LikelihoodGradientJob "ConstrainedAndOffset" test cases used the wrong argument for the constrained parameter. These are now corrected from alpha_bkg_obs_A to become alpha_bkg_A.
    • In LikelihoodJobTest, the added test case "UnbinnedGaussian1DSelectedParameterValues" shows where splitting over events can lead to bit-wise differences. This test will be useful in the future if further precision enhancements are made.

Checklist:

  • [x] tested changes locally
  • [ ] updated the docs (if necessary)

egpbos avatar Feb 01 '23 14:02 egpbos

Hi @egpbos, thanks for the PR! It looks good, but can you please update such that is passes on all CI nodes? Thanks!

guitargeek avatar Feb 14 '23 11:02 guitargeek

Yes, will do. I have been busy fixing a bug and with some other obligations. I will get back to this soon.

egpbos avatar Feb 14 '23 12:02 egpbos

I have rebased this commit on master of last Friday. Let's see whether it now passes CI. @guitargeek is it ok with you to postpone the changes from your review comment to a separate issue? It seems like a cleanup task not directly related to the purpose of this PR.

egpbos avatar Oct 11 '23 10:10 egpbos

Test Results

    10 files      10 suites   1d 18h 59m 5s :stopwatch:  2 631 tests  2 631 :white_check_mark: 0 :zzz: 0 :x: 24 812 runs  24 812 :white_check_mark: 0 :zzz: 0 :x:

Results for commit 4e3f16bd.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Oct 11 '23 10:10 github-actions[bot]

@phsft-bot build

egpbos avatar Apr 22 '24 16:04 egpbos

Hi @egpbos, thank you so much for picking this up again!

Your @phsft-bot build command will not do anything though :slightly_smiling_face: We have retired the Jenkins CI and the bot.

Our new CI is ran automatically, and you can check it's status at the bottom of the PR thread.

Once are platforms are finished running, the GitHub actions bot will also post a comment with a link to a summary, where you can see at one glance which tests fail on which platform and how.

guitargeek avatar Apr 22 '24 19:04 guitargeek

I figured as much, old habits :)

egpbos avatar Apr 22 '24 19:04 egpbos

I think this can finally be reviewed, CI was passing (just removed a stray printf in the last force push).

egpbos avatar Apr 23 '24 20:04 egpbos

The latest commit changes only formatting and now all CI checks pass (see the CI checks of the previous commit).

egpbos avatar Apr 24 '24 07:04 egpbos

The ROOT CI / mac14 failure is not due to this PR (see the successful build in the previous commit).

egpbos avatar Apr 24 '24 08:04 egpbos

@guitargeek While addressing your comments, I started thinking that there's perhaps a better design.

  1. I don't think we should have a default parameter at all, but only one ctor with the offset parameters. In real life current cases, these classes are always used from MinuitFcnGrad and they always share offsets with one or two other Likelihood* objects. The only cases where no shared offsets are necessary is in unit tests, but in those cases it's easy to just create a dummy offset vector.
  2. All the offset management is distracting a bit from the core Likelihood logic. I think it may be nicer to make a simple wrapper class LikelihoodOffset that holds the shared_ptrs to the vectors, does the swapping, and a few other things. This also makes it easier to create, because the unwieldy types that you also rightly point out would then only be contained within LikelihoodOffset.

I will start working on point 1 at least, and in the meantime will consider (and allow you to comment on ;) ) point 2.

egpbos avatar May 02 '24 19:05 egpbos

I have also started on point 2 now.

egpbos avatar May 03 '24 20:05 egpbos

Rebased on master to make the CI pass on clang-format (✅) and mac14 which also mysteriously fails (I think it's not due to this PR, but still).

egpbos avatar May 04 '24 14:05 egpbos

Failing test on ubuntu20 (1697:roottest-python-numba-numba) is unrelated to this PR.

The mac14 run already fails on configuration (some LibXml2 path failure). I don't see how this could be caused by this PR, but it's strange that this failure doesn't seem to be present in other PRs...

egpbos avatar May 05 '24 10:05 egpbos

Ah, I see this PR also had the LibXml2 failure in the mac14 workflow: https://github.com/root-project/root/actions/runs/8956800442/job/24599041512

So indeed, CI failures unrelated to this PR.

egpbos avatar May 05 '24 10:05 egpbos

I overlooked the review comment of Feb last year above. I accidentally almost already addressed it by removing all cloning and copying support. The only thing left was to remove the comments, which I now did as well. I amended the previous commit with that. The only difference with before is the removed comments, so behaviorally no change.

egpbos avatar May 05 '24 11:05 egpbos