Re-order terms in division to avoid periodic decimal and rounding issues in Centroid#add, fixing out-of-order centroids
Closes #169
Hi @tdunning :wave:
I thought this issue was interesting and would allow me to understand more about the t-digest. Tweaking the values used in the tests, I found that changing the compression to 5 (in the factory() method) and using this would always reproduce the bug.
public void testSorted2() {
final TDigest digest = factory().create();
((AVLTreeDigest)digest).gen.setSeed(84);
int[] weights = new int[] { 7 };
double[] xs = new double[] {
0.8046947099707735 };
assertEquals(weights.length, xs.length);
for (int i = 0; i < weights.length; ++i) {
int w = weights[i];
double x = xs[i];
for (int j = 0; j < w; j++) {
digest.add(x);
}
}
Centroid previous = null;
for (Centroid centroid : digest.centroids()) {
if (previous != null) {
assertTrue(previous.mean() <= centroid.mean());
}
previous = centroid;
}
System.out.println("OK!");
}
Debugging a little more, I realized the issue was happening when the count[] array was updated, and the value of a centroid became 3 (search the line with count = 3, note how the value is out of order).

A little more debugging and I think I found what was happening. Apparently the Centroid#add method could end up having to calculate this centroid += w * (x - centroid) / count;, which would not always return the expected value (see the third/sixth/eleventh lines)
1 * (0.8046947099707735 - 0.0) / 1 = 0.8046947099707735
2 * (0.8046947099707735 - 0.0) / 2 = 0.8046947099707735
3 * (0.8046947099707735 - 0.0) / 3 = 0.8046947099707736
4 * (0.8046947099707735 - 0.0) / 4 = 0.8046947099707735
5 * (0.8046947099707735 - 0.0) / 5 = 0.8046947099707735
6 * (0.8046947099707735 - 0.0) / 6 = 0.8046947099707736
7 * (0.8046947099707735 - 0.0) / 7 = 0.8046947099707735
8 * (0.8046947099707735 - 0.0) / 8 = 0.8046947099707735
9 * (0.8046947099707735 - 0.0) / 9 = 0.8046947099707735
10 * (0.8046947099707735 - 0.0) / 10 = 0.8046947099707735
11 * (0.8046947099707735 - 0.0) / 11 = 0.8046947099707734
I believe it's due to multiplying by 3 to then divide by 3 not being computed in a stable way. I shuffled the terms in the expression, hopefully without changing what it was doing, and I think at least that error should now be fixed? I also tried to add a simple test to reproduce it.
Fixing this should produce centroids with the correct values and order, thus fixing #169
Cheers
Bruno
Not sure if the Windows errors are legit or just an issue with Windows GH actions.
Your change might help, but there is a fundamental problem in AVLTreeDigest that order is not absolutely guaranteed due to an entire class of problems of which the one that you point out is only an example. The problem is far worse when you have several centroids with identical means. In such a case, any change to the mean should result in potentially large changes to the order in the tree, but the code doesn't handle this well. This (and my inability to find enough time to dig in on the problem) is one of the two major reasons that MergingDigest is strongly recommended. The other reason is that AVLTreeDigest hasn't had the serious attention to detail regarding interpolation special cases that MergingDigest has had. Again, scarcity of time is at the root.
In any case, AVLTreeDigest requires dynamic allocation and there is no chance of a zero serialization reader or writer. These are secondary issues, but still weigh in a bit.
Your change might help, but there is a fundamental problem in AVLTreeDigest that order is not absolutely guaranteed due to an entire class of problems of which the one that you point out is only an example.
:+1:
The problem is far worse when you have several centroids with identical means. In such a case, any change to the mean should result in potentially large changes to the order in the tree, but the code doesn't handle this well. This (and my inability to find enough time to dig in on the problem) is one of the two major reasons that MergingDigest is strongly recommended.
Good to know!!
The other reason is that AVLTreeDigest hasn't had the serious attention to detail regarding interpolation special cases that MergingDigest has had. Again, scarcity of time is at the root
In any case, AVLTreeDigest requires dynamic allocation and there is no chance of a zero serialization reader or writer. These are secondary issues, but still weigh in a bit.
I believe that could explain why the other implementations I've seen so far appear to be significantly simpler than both AVL & Merge digests here. Looks to me like FaceBook's C++ impl, Arrow, CamDavidsonPilon/tdigest, the other one for Postgres, etc., have adapted and also simplified the code.
But any way, happy to help with the AVL if there are easy issues, otherwise I will probably start playing with the merge t-digest now :slightly_smiling_face:
Let me know if you'd like me to close this PR, or leave it open (maybe as draft?) for some future work.
Thanks @tdunning !
We need to keep this open if there isn't another issue for this.
Regarding simplicity, a big part of the complexity in the Java implementation comes from the fact that I wanted a vector of structs which is very inefficient due to boxing. Thus, the struct with multiple vectors. This shift required a special sort implementation and makes all of the loops less obvious.
The Julia implementation is the opposite. Julia emits efficient code for an array of structs and the internal sort handles that vector cleanly.
This same would apply to C++ if you use the standard library. For python, you could do something similar and depending on a built in sort would actually be pretty important for performance.
On Tue, May 9, 2023 at 8:43 AM Bruno P. Kinoshita @.***> wrote:
Your change might help, but there is a fundamental problem in AVLTreeDigest that order is not absolutely guaranteed due to an entire class of problems of which the one that you point out is only an example.
👍
The problem is far worse when you have several centroids with identical means. In such a case, any change to the mean should result in potentially large changes to the order in the tree, but the code doesn't handle this well. This (and my inability to find enough time to dig in on the problem) is one of the two major reasons that MergingDigest is strongly recommended.
Good to know!!
The other reason is that AVLTreeDigest hasn't had the serious attention to detail regarding interpolation special cases that MergingDigest has had. Again, scarcity of time is at the root
In any case, AVLTreeDigest requires dynamic allocation and there is no chance of a zero serialization reader or writer. These are secondary issues, but still weigh in a bit.
I believe that could explain why the other implementations I've seen so far appear to be significantly simpler than both AVL & Merge digests here. Looks to me like FaceBook's C++ impl, Arrow, CamDavidsonPilon/tdigest, the other one for Postgres, etc., have adapted and also simplified the code.
But any way, happy to help with the AVL if there are easy issues, otherwise I will probably start playing with the merge t-digest now 🙂
Let me know if you'd like me to close this PR, or leave it open (maybe as draft?) for some future work.
Thanks @tdunning https://github.com/tdunning !
— Reply to this email directly, view it on GitHub https://github.com/tdunning/t-digest/pull/209#issuecomment-1540430988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6SKBE64EMG55TIYXWLXFJQZJANCNFSM6AAAAAAXZ75SW4 . You are receiving this because you were mentioned.Message ID: @.***>