[hist] Improve precision of TAxis::FindFixBin / FindBin.
The bin computation in TAxis can suffer from floating-point uncertainties, returning bins such that this assertion breaks:
int bin = axis.FindBin(x);
assert(axis.GetBinLowEdge(bin) <= x && x < axis.GetBinUpEdge(bin));
This is of course surprising for users, and sometimes brings data into the overflow, although they fall into the valid range of the axis. Here, two terms are added, which fix these floating-point errors.
On a recent AMD CPU, there is no runtime difference, since the latency of the virtual call overshadows the additional computations for the correction. For 10M calls, running 100 times with perf, we arrive at:
| Uncorrected | Corrected | |
|---|---|---|
| Instructions | 250M | 360M |
| cycles | 216M | 218M |
| Run time | 0.0495 +- 0.0003 s | 0.0493 +- 0.0002 s |
Note: Several attempts at reordering the equations for better precision failed. There were always a few test cases in the included tests that didn't pass, so applying the correction looks to be the best solution.
This PR fixes the issues brought up here:
- https://root-forum.cern.ch/t/floating-point-rounding-error-when-filling-the-histogram/35189
- https://root-forum.cern.ch/t/bug-or-feature-in-ttree-draw/62862/16
- 1703c54
- #14091
Thanks a lot Stephan! Some questions from my side:
- Does this supersede https://github.com/root-project/root/pull/14105 ? Maybe it can be closed.
- Does this fix https://github.com/root-project/root/issues/14091 ? Maybe it can be linked.
- Does this new implementation affect runtime performance when filling a histogram? If yes, maybe it's better to have a separate function FindBinHighRes or so. One could even think of using long-double.
- Is it sure that the resulting bin is only off by one? I can imagine of (probably very weird cases) where the floating point limited precision might be two?
Thanks a lot Stephan! Some questions from my side:
- Does this supersede [hist] Improve calculation of
TAxis::FindBin(x)when x is at the bin edgesΒ #14105 ? Maybe it can be closed.- Does this fix Bug in TAxis::FindBin (Double_t x) ?Β #14091 ? Maybe it can be linked.
- Does this new implementation affect runtime performance when filling a histogram? If yes, maybe it's better to have a separate function FindBinHighRes or so. One could even think of using long-double.
- Is it sure that the resulting bin is only off by one? I can imagine of (probably very weird cases) where the floating point limited precision might be two?
Good questions, I'm checking! π
Test Results
ββββ20 filesββββββ20 suitesβββ3d 18h 31m 13s β±οΈ β3β016 testsββ3β015 β β0 π€β1 β 58β678 runsββ58β677 β β0 π€β1 β
For more details on these failures, see this check.
Results for commit dbe40974.
:recycle: This comment has been updated with latest results.
- Does this supersede [hist] Improve calculation of
TAxis::FindBin(x)when x is at the bin edgesΒ #14105 ? Maybe it can be closed.
It can be closed, indeed.
- Does this fix Bug in TAxis::FindBin (Double_t x) ?Β #14091 ? Maybe it can be linked.
It fixes the problem described there. I linked the issue.
- Does this new implementation affect runtime performance when filling a histogram? If yes, maybe it's better to have a separate function FindBinHighRes or so. One could even think of using long-double.
It does not affect run time on a modern CPU, even though more instructions are issued. When you run the bare code to find the bin (not as part of a virtual function), you have the following behaviour:
| Uncorrected | Corrected | |
|---|---|---|
| Instructions | 8 | 19 |
| Cycles | 33 | 31 |
It doesn't matter how many instructions are issued. What's interesting is how many cycles they need to complete. Although that depends on the CPU in use, you see that the two versions are very close.
The killer argument, however, is that all of this happens inside a virtual function, and that of course has its own latencies. When plugging the codes into a virtual function, and benchmarking 10M calls, we arrive at:
| Uncorrected | Corrected | |
|---|---|---|
| Instructions | 250M | 360M |
| cycles | 216M | 218M |
| Run time | 0.0495 +- 0.0003 s | 0.0493 +- 0.0002 s |
So you see that more instructions are issued, but due to the CPU being a superscalar CPU, these additional instructions have no noteworthy effect on the run time, because the correction terms can be evaluated in parallel to the original computation.
This means, that in TAxis, where it is part of a virtual function, we should not see a run time difference.
- Is it sure that the resulting bin is only off by one? I can imagine of (probably very weird cases) where the floating point limited precision might be two?
I did not manage to provoke a more-than-one difference. I didn't set out to prove it mathematically, but I had lots of difficulties crafting a case that made an "off by -1" error.
Awesome analysis, thanks a lot!! Does this mean that I can close https://github.com/root-project/root/pull/17689 if you implement the same strategy in that part of the code ? And not sure if you want to review this alleviation strategy: https://github.com/root-project/root/pull/17896
Btw, this PR reminds me of this issue: https://its.cern.ch/jira/browse/ROOT-10179
Awesome analysis, thanks a lot!! Does this mean that I can close #17689 if you implement the same strategy in that part of the code ?
@ferdymercury I think for this PR, the x == max --> overflow problem remains. So you could probably just use:
- max = currentMax;
+ max = std::nextafter(currentMax, inf);
but that's to be tested.
Btw, this PR reminds me of this issue: https://its.cern.ch/jira/browse/ROOT-10179
Was this the correct link?
Was this the correct link?
It was about the question/fact of whether performance was impacted. Maybe more the subissue https://its.cern.ch/jira/browse/ROOT-10185 which would have been easier for you if rootbench could be triggered from a PR.
max = std::nextafter(currentMax, inf);
I had tried that and it didn't work. I really had to add
xmax = std::max(xmax + 1e-15*(xmax - xmin), std::nextafter(xmax,INFINITY));
For most cases, the 1e-15*max_width is what really counts. Or do you mean that that should be fine with just nextafter, 'after' your PR gets merged?
Related: https://github.com/root-project/root/pull/17896
Btw, this PR reminds me of this issue: https://its.cern.ch/jira/browse/ROOT-10179
Yes - well remembered @ferdymercury . RootBench should first of all be revived and regularly built and run as part of the battery of actions, as a start, nightly. After that is achieved, we can decide whether the timing resolution we have on the current machines used for the CI is enough.
Awesome analysis, thanks a lot!! Does this mean that I can close #17689 if you implement the same strategy in that part of the code ?
@ferdymercury it looks like your change is still required. See this test (without your PR):
TH1F histo("histo", "Histo", 100, 0, 0);
histo.Fill(-999); histo.Fill(0);
histo.GetXaxis()->GetBinLowEdge(101) == 0.; // Overflow bin starts at 0.!
Right, that's because as I said in Lorenzos PR, the functions like GetBinLowEdge() have separate implementations:
https://github.com/root-project/root/blob/master/hist/hist/src/TAxis.cxx#L522
Now we are in the situation that some functions (TAxis::FindBin) have this precision optimization with this PR, and others don't. So ROOT is a bit self consistent now :slightly_smiling_face: That's the reason why I didn't merge Lorenzos PR. I thought letting the precision of of TAxis::FindBin() get out of sync with other algorithms in ROOT was too risky.
But it's some time I looked at this and I don't remember the details now :slightly_smiling_face: I just wanted to point out the possible problems as good as I could with the memory I had.
Right, that's because as I said in Lorenzos PR, the functions like
GetBinLowEdge()have separate implementations:https://github.com/root-project/root/blob/master/hist/hist/src/TAxis.cxx#L522
Now we are in the situation that some functions (
TAxis::FindBin) have this precision optimization with this PR, and others don't. So ROOT is a bit self consistent now π That's the reason why I didn't merge Lorenzos PR. I thought letting the precision of ofTAxis::FindBin()get out of sync with other algorithms in ROOT was too risky.
No, actually I don't think that's what happened in this PR. Only now, the GetBinLowEdge and FindBin are in sync! π
The trick about this change is that we compute the bin in the same old way (with fp uncertainties), and the new terms ensure consistency with what GetBinLowEdge returns. So I would argue that now, for the first time in 25 years, TAxis is consistent.
The test that I included is supposed to reflect this. The hand-written parts are engineered using GetBinLowEdge, and the externally provided examples frequently compare to GetBinLowEdge.
What the initial comment referred to is a deficiency of the auto-binning feature when you don't give any axis limits. This needs a bit more work, and ferdy was on to it.