datasketches-java icon indicating copy to clipboard operation
datasketches-java copied to clipboard

Non-deterministic result when merging an empty KllFloatsSketch with two others

Open thomasrebele opened this issue 1 month ago • 8 comments

Merging an empty KllFloatsSketch with two KllFloatsSketch, with 1 and 200 "items" respectively, does not always produce the same result

I would have expected the following test scenario to pass:

@Test
  public void test() throws NoSuchAlgorithmException {

    KllFloatsSketch t1 = KllFloatsSketch.newHeapInstance();
    t1.update(1f);
    byte[] tb1 = t1.toByteArray();

    KllFloatsSketch t2 = KllFloatsSketch.newHeapInstance();
    for(int i=0; i<200; i++) {
      t2.update(1f*i);
    }
    byte[] tb2 = t2.toByteArray();

    HashSet<BigInteger> digests = new HashSet<>();
    for(int i=0; i<30; i++) {
      KllFloatsSketch start = KllFloatsSketch.newHeapInstance();

      byte[] h1 = Arrays.copyOf(tb1, tb2.length);
      byte[] h2 = Arrays.copyOf(tb2, tb2.length);

      KllFloatsSketch kll1 = KllFloatsSketch.heapify(MemorySegment.ofArray(h1));
      start.merge(kll1);

      KllFloatsSketch kll2 = KllFloatsSketch.heapify(MemorySegment.ofArray(h2));
      start.merge(kll2);

      MessageDigest md5 = MessageDigest.getInstance("MD5");

      BigInteger digest = new BigInteger(md5.digest(start.toByteArray()));
      digests.add(digest);
      System.out.println(digest);
    }
    assertEquals(1, digests.size());
  }

The digests are:

115710133967357289505160160937439690295
-100233422360292323003164315945381734567
-100233422360292323003164315945381734567
-100233422360292323003164315945381734567
115710133967357289505160160937439690295
115710133967357289505160160937439690295
...

And therefore the test throws:

java.lang.AssertionError:
Expected :2
Actual   :1

thomasrebele avatar Nov 21 '25 16:11 thomasrebele

I've discovered this issue while debugging HIVE-29334. The Hive code might merge many KllFloatsSketch, so the instability is accumulating. I've seen that the randomness is on purpose.

A workaround: if the merge could provide a parameter Random random, so that the caller can reuse the same generator for all merge operations.

thomasrebele avatar Nov 21 '25 17:11 thomasrebele

Deterministic merges would break the sketch accuracy guarantees

jmalkin avatar Nov 21 '25 20:11 jmalkin

Thank you for the hint, @jmalkin. Indeed, and the functionality should not change when calling the existing methods. I would propose to add another method for the callers that prefer deterministic output over error guarantees. I've prepared a PR that implements my suggestion. What do you think about it?

thomasrebele avatar Nov 24 '25 16:11 thomasrebele

@jmalkin, I've read your comment on the PR. Apache Hive uses KLL for histogram statistics. Due to the non-deterministic behavior of the KLL sketch merge, some estimates change for every query compilation. This is quite irritating for the user. Could you have a look at https://issues.apache.org/jira/browse/HIVE-29334 please, and suggest an alternative or workaround?

thomasrebele avatar Nov 24 '25 16:11 thomasrebele

How about using an approximate assertion?

AlexanderSaydakov avatar Nov 25 '25 22:11 AlexanderSaydakov

If the test is adapted, the behavior of the merge would stay nondeterministic. For Hive histogram statistics this results in a non-deterministic behavior for the EXPLAIN CBO JOINCOST or also EXPLAIN command, see HIVE-29334. It would be nice to add some alternative methods with the Random parameter. The behavior for the existing callers would not change. In case the caller needs deterministic merge results, they can provide a Random number generator with a predefined seed, though with the loss of the error guarantees.

thomasrebele avatar Nov 25 '25 23:11 thomasrebele

well, "loss of the error guarantees" really means that a proven algorithm is not proven anymore, and can potentially lead to catastrophically wrong results. unless someone conducts a study to quantify the degradation, this really means that the algorithm is broken

AlexanderSaydakov avatar Nov 26 '25 18:11 AlexanderSaydakov

@thomasrebele, What both @jmalkin and @AlexanderSaydakov are trying to explain to you is that many sketch algorithms, including KLL, are probabilistic in nature. This means that any result you get is a random draw from a known probability distribution. This is true even if the sketch is fed the exact same input data and in the same order. From the mathematics of the proofs, we can determine what the confidence interval is that bounds the result you get. The accuracy "guarantee" of these sketches is just that -- that the result will be within the confidence interval (bounds) with the stated confidence, which is determined by your choice of "k" that configures the sketch. This is by design.

If you force the sketch to be deterministic by fixing the seed of the hash function you destroy the probabilistic guarantee and the results of the sketch will be certainly biased by an unknown amount and in an unknown directlon. So it makes no sense to provide an alternate method that "provides a Random number generator with a predefined seed".

What needs to change is not the behavior of the sketch, but your expectation that the results from a probabilistic algorithm will always be exactly the same. It won't be, and it can't be. It will be close, but not exactly the same. Change your code so that if the result is within +/- some percent of some previous value, it is acceptable. Choose a percentage threshold that is 3 or 4X the epsilon of the sketch, given K. This will decrease the probability of failure to almost zero.

Cheers, Lee.

leerho avatar Nov 27 '25 00:11 leerho

Thank you @AlexanderSaydakov and @leerho for the feedback. I understand that KLL is probabilistic in nature. Probabilistic updates (feeding the data to a single KLL sketch) are also not my issue. I'm happy with the non-determinism there. However, I still need to get a deterministic result for my use case: merging n KLL sketches to a single one.

I agree that miss-using the random number generator (RNG) might lead to bad sketches: if the same seed is used for every KllSketch#merge(KllSketch, Random) call, then the errors add up and become quite large. However, there's a way around this: When the RNG is initialized with a fixed seed at the beginning and re-used for the merge operations, then the error seems to be the same as the original method that uses the RNG KllSketch#random.

I've prepared some experiments based on the proposed PR. It seems to me that the deterministic merge is still good enough to be used in my use case, as the errors are very similar to the errors of the original merge method. Please see https://github.com/thomasrebele/datasketches-java/commit/8abbddfea4e2dddc6849e6fe7c44dcd83dae17a5 for the results and the code. I'm not sure whether I measured the normalized rank error correctly, as I could not find the exact definition. I'm happy to adapt the code if you point me to its definition.

Would you have a look at my experiment, please? I'm happy to extend it if you think more experiments are necessary.

thomasrebele avatar Dec 01 '25 17:12 thomasrebele

@thomasrebele.

I don't see in your replies that you have explored the possibility of changing the expectation to allow for slight variance in the computed result. What is preventing that? Is this a human expectation or a software tool expectation?

The case that is important to you, "merging n KLL sketches to a single one" is also important to us as it is the most important case where sketching requires the probabilistic behavior to maintain the guarantee of accuracy.

Even if you could fix the seed of the RNG, in a large clustered environment it is hard to control the order that results come in from the various nodes, and changing the order of the input will also create slight variance in the output. Unless you force the strict order of inputs, and with thousands of input nodes, that will be very expensive both processing cost and time cost.

As long as you accept the probabilistic behavior of results being within a tolerance, sketching is insensitive to order! This means that merging n sketches from thousands of nodes can be performed as fast as the data comes in. No need to sort or control the order.

leerho avatar Dec 14 '25 23:12 leerho

Thank you for your suggestion. The place where the KLL sketches are merged in Hive is a regular Java function, so the order of the sketches can be enforced. I expect the number of sketches to be merged to be less than one million.

The Hive project also indirectly uses the KLL sketch results in the various EXPLAIN commands in the q.out files. The q.out files are compared with the expected version. If the results are not stable, then there's the possibility to mask the results. However, if the purpose is to check whether the statistics have been calculated correctly, then masking them does not help. Switching to comparing the results by allowing a certain uncertainty opens many other questions: how to evaluate the uncertainty? How to define the threshold when the comparison should fail? How to avoid the problem of flaky tests? Hive's test when creating a PR take several hours on a cluster, and it is quite annoying if an unrelated test fails due to a reason unrelated to the PR.

I would even go as far to say it would be nice to have an (additional!) deterministic update method to facilitate the testing in Hive.

Please be reassured: I want to KEEP the existing behavior. I just want to add another method to make it possible to allow 3rd-party libraries to overwrite the RNG in case of need. The javadoc of the new methods should clearly state that the caller is responsible to evaluate whether it is safe to provide their own RNG.

thomasrebele avatar Dec 18 '25 16:12 thomasrebele