root icon indicating copy to clipboard operation
root copied to clipboard

[hist] Improve precision of TAxis::FindFixBin / FindBin.

Open hageboeck opened this issue 6 months ago β€’ 8 comments

The bin computation in TAxis can suffer from floating-point uncertainties, returning bins such that this assertion breaks:

int bin = axis.FindBin(x);
assert(axis.GetBinLowEdge(bin) <= x && x < axis.GetBinUpEdge(bin));

This is of course surprising for users, and sometimes brings data into the overflow, although they fall into the valid range of the axis. Here, two terms are added, which fix these floating-point errors.

On a recent AMD CPU, there is no runtime difference, since the latency of the virtual call overshadows the additional computations for the correction. For 10M calls, running 100 times with perf, we arrive at:

Uncorrected Corrected
Instructions 250M 360M
cycles 216M 218M
Run time 0.0495 +- 0.0003 s 0.0493 +- 0.0002 s

Note: Several attempts at reordering the equations for better precision failed. There were always a few test cases in the included tests that didn't pass, so applying the correction looks to be the best solution.

This PR fixes the issues brought up here:

  • https://root-forum.cern.ch/t/floating-point-rounding-error-when-filling-the-histogram/35189
  • https://root-forum.cern.ch/t/bug-or-feature-in-ttree-draw/62862/16
  • 1703c54
  • #14091

hageboeck avatar Jun 13 '25 12:06 hageboeck

Thanks a lot Stephan! Some questions from my side:

  • Does this supersede https://github.com/root-project/root/pull/14105 ? Maybe it can be closed.
  • Does this fix https://github.com/root-project/root/issues/14091 ? Maybe it can be linked.
  • Does this new implementation affect runtime performance when filling a histogram? If yes, maybe it's better to have a separate function FindBinHighRes or so. One could even think of using long-double.
  • Is it sure that the resulting bin is only off by one? I can imagine of (probably very weird cases) where the floating point limited precision might be two?

ferdymercury avatar Jun 13 '25 13:06 ferdymercury

Thanks a lot Stephan! Some questions from my side:

Good questions, I'm checking! πŸ™‚

hageboeck avatar Jun 13 '25 14:06 hageboeck

Test Results

β€‡β€‡β€ˆβ€‡20 filesβ€„β€ƒβ€‡β€‡β€ˆβ€‡20 suites   3d 18h 31m 13s ⏱️  3β€ˆ016 tests  3β€ˆ015 βœ…β€ƒ0 πŸ’€β€ƒ1 ❌ 58β€ˆ678 runsβ€Šβ€ƒ58β€ˆ677 βœ…β€ƒ0 πŸ’€β€ƒ1 ❌

For more details on these failures, see this check.

Results for commit dbe40974.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Jun 13 '25 17:06 github-actions[bot]

It can be closed, indeed.

It fixes the problem described there. I linked the issue.

  • Does this new implementation affect runtime performance when filling a histogram? If yes, maybe it's better to have a separate function FindBinHighRes or so. One could even think of using long-double.

It does not affect run time on a modern CPU, even though more instructions are issued. When you run the bare code to find the bin (not as part of a virtual function), you have the following behaviour:

Uncorrected Corrected
Instructions 8 19
Cycles 33 31

It doesn't matter how many instructions are issued. What's interesting is how many cycles they need to complete. Although that depends on the CPU in use, you see that the two versions are very close.

The killer argument, however, is that all of this happens inside a virtual function, and that of course has its own latencies. When plugging the codes into a virtual function, and benchmarking 10M calls, we arrive at:

Uncorrected Corrected
Instructions 250M 360M
cycles 216M 218M
Run time 0.0495 +- 0.0003 s 0.0493 +- 0.0002 s

So you see that more instructions are issued, but due to the CPU being a superscalar CPU, these additional instructions have no noteworthy effect on the run time, because the correction terms can be evaluated in parallel to the original computation.

This means, that in TAxis, where it is part of a virtual function, we should not see a run time difference.

  • Is it sure that the resulting bin is only off by one? I can imagine of (probably very weird cases) where the floating point limited precision might be two?

I did not manage to provoke a more-than-one difference. I didn't set out to prove it mathematically, but I had lots of difficulties crafting a case that made an "off by -1" error.

hageboeck avatar Jun 17 '25 12:06 hageboeck

Awesome analysis, thanks a lot!! Does this mean that I can close https://github.com/root-project/root/pull/17689 if you implement the same strategy in that part of the code ? And not sure if you want to review this alleviation strategy: https://github.com/root-project/root/pull/17896

ferdymercury avatar Jun 17 '25 12:06 ferdymercury

Btw, this PR reminds me of this issue: https://its.cern.ch/jira/browse/ROOT-10179

ferdymercury avatar Jun 17 '25 13:06 ferdymercury

Awesome analysis, thanks a lot!! Does this mean that I can close #17689 if you implement the same strategy in that part of the code ?

@ferdymercury I think for this PR, the x == max --> overflow problem remains. So you could probably just use:

- max = currentMax;
+ max = std::nextafter(currentMax, inf);

but that's to be tested.

Btw, this PR reminds me of this issue: https://its.cern.ch/jira/browse/ROOT-10179

Was this the correct link?

hageboeck avatar Jun 17 '25 14:06 hageboeck

Was this the correct link?

It was about the question/fact of whether performance was impacted. Maybe more the subissue https://its.cern.ch/jira/browse/ROOT-10185 which would have been easier for you if rootbench could be triggered from a PR.

 max = std::nextafter(currentMax, inf);
    

I had tried that and it didn't work. I really had to add xmax = std::max(xmax + 1e-15*(xmax - xmin), std::nextafter(xmax,INFINITY));

For most cases, the 1e-15*max_width is what really counts. Or do you mean that that should be fine with just nextafter, 'after' your PR gets merged?

Related: https://github.com/root-project/root/pull/17896

ferdymercury avatar Jun 17 '25 14:06 ferdymercury

Btw, this PR reminds me of this issue: https://its.cern.ch/jira/browse/ROOT-10179

Yes - well remembered @ferdymercury . RootBench should first of all be revived and regularly built and run as part of the battery of actions, as a start, nightly. After that is achieved, we can decide whether the timing resolution we have on the current machines used for the CI is enough.

dpiparo avatar Jun 21 '25 07:06 dpiparo

Awesome analysis, thanks a lot!! Does this mean that I can close #17689 if you implement the same strategy in that part of the code ?

@ferdymercury it looks like your change is still required. See this test (without your PR):

TH1F histo("histo", "Histo", 100, 0, 0);
histo.Fill(-999); histo.Fill(0);
histo.GetXaxis()->GetBinLowEdge(101) == 0.; // Overflow bin starts at 0.!

hageboeck avatar Jun 23 '25 13:06 hageboeck

Right, that's because as I said in Lorenzos PR, the functions like GetBinLowEdge() have separate implementations:

https://github.com/root-project/root/blob/master/hist/hist/src/TAxis.cxx#L522

Now we are in the situation that some functions (TAxis::FindBin) have this precision optimization with this PR, and others don't. So ROOT is a bit self consistent now :slightly_smiling_face: That's the reason why I didn't merge Lorenzos PR. I thought letting the precision of of TAxis::FindBin() get out of sync with other algorithms in ROOT was too risky.

But it's some time I looked at this and I don't remember the details now :slightly_smiling_face: I just wanted to point out the possible problems as good as I could with the memory I had.

guitargeek avatar Jun 23 '25 13:06 guitargeek

Right, that's because as I said in Lorenzos PR, the functions like GetBinLowEdge() have separate implementations:

https://github.com/root-project/root/blob/master/hist/hist/src/TAxis.cxx#L522

Now we are in the situation that some functions (TAxis::FindBin) have this precision optimization with this PR, and others don't. So ROOT is a bit self consistent now πŸ™‚ That's the reason why I didn't merge Lorenzos PR. I thought letting the precision of of TAxis::FindBin() get out of sync with other algorithms in ROOT was too risky.

No, actually I don't think that's what happened in this PR. Only now, the GetBinLowEdge and FindBin are in sync! πŸ˜„ The trick about this change is that we compute the bin in the same old way (with fp uncertainties), and the new terms ensure consistency with what GetBinLowEdge returns. So I would argue that now, for the first time in 25 years, TAxis is consistent. The test that I included is supposed to reflect this. The hand-written parts are engineered using GetBinLowEdge, and the externally provided examples frequently compare to GetBinLowEdge.

What the initial comment referred to is a deficiency of the auto-binning feature when you don't give any axis limits. This needs a bit more work, and ferdy was on to it.

hageboeck avatar Jun 23 '25 14:06 hageboeck