t-digest icon indicating copy to clipboard operation
t-digest copied to clipboard

Determining quality

Open devinrsmith opened this issue 3 years ago • 3 comments

After updating from 3.2 to 3.3, some of our tests with hardcoded "quality" checks were failing. I captured my notes and speculated a bit in https://github.com/deephaven/deephaven-core/issues/3204. I'm sure there is a lot of nuance to determining "accuracy and quality", but I'm wondering if there is any general guidance on how one should think / measure quality wrt t-digests.

Maybe related to the note on the README:

describe accuracy using the quality suite

Potential related to https://github.com/tdunning/t-digest/issues/85

It may be that our test is not indicative of "real life" usage.

I'm sure it's a balancing act, trying to optimize t-digest in general may make some specific datasets perform worse.

devinrsmith avatar Dec 14 '22 18:12 devinrsmith

Can you say more about what your test does?

There were some changes in 3.3 that tended to make the number of retained centroids smaller (i.e. closer to the specified number). This has the effect of degrading accuracy for a constant compression factor although accuracy relative to the size of the digest typically improves.

This affected another package where the compression factor for the tests was a relatively small value of 100. IN their case upping the default value to 200 had the desired effect (same memory use, improved accuracy).

The other possible changed before regards previously incorrect handling of some cases with small amounts of data. This can interact with poor test design. The basic problem stems from the fact that while the cumulative distribution is a function, the inverse (the quantile function) is not. T-digest is supposed to use treat unit centroids as being impulses in the EPDF located at the sample value. This causes the cumulative distribution to take half steps at discontinuities at each distinct sample value.

For the inverse cumulative distribution, there is no well defined value for values of q in the range of this discontinuity. This can make it hard to write good tests for the desired behavior.

To figure out which of these phenomena you are seeing or even a novel issue, I need to know more about what you are doing in your test.

On Wed, Dec 14, 2022 at 10:12 AM Devin Smith @.***> wrote:

After updating from 3.2 to 3.3, some of our tests with hardcoded "quality" checks were failing. I captured my notes and speculated a bit in deephaven/deephaven-core#3204 https://github.com/deephaven/deephaven-core/issues/3204. I'm sure there is a lot of nuance to determining "accuracy and quality", but I'm wondering if there is any general guidance on how one should think / measure quality wrt t-digests.

Maybe related to the note on the README:

describe accuracy using the quality suite

Potential related to #85 https://github.com/tdunning/t-digest/issues/85

It may be that our test is not indicative of "real life" usage.

I'm sure it's a balancing act, trying to optimize t-digest in general may make some specific datasets perform worse.

— Reply to this email directly, view it on GitHub https://github.com/tdunning/t-digest/issues/205, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6WIQO4ARACJNY6SGZDWNIEYZANCNFSM6AAAAAAS6ZICLE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

tdunning avatar Dec 15 '22 04:12 tdunning

The test that passes in 3.2 and fails in 3.3 is essentially a (seeded) uniform random distribution of 10000 doubles in the range [-10000.0, 10000.0]. We check abs((p_x-t_x)/p_x) < 0.005 for x in [75, 99, 99.9] (p_x is percentile, t_x is t-digest with compression 100). (The test also does the same checks on 4 non-overlapping subsets that in union equal the original distribution.)

Upon seeing these fail in 3.3, I added a single root-mean-square error calculation that accumulates the errors across this distribution, as well as re-seeding the test w/ 1000 distinct seeds. A bit hand-wavy, but in the end the calculation produces

3.2 RMSE = 0.00090 3.3 RMSE = 0.00127

I suspect I might find

upping the default value to 200 had the desired effect (same memory use, improved accuracy)

, so I will look into that.

I think another factor is I don't have a good idea of what "compression" is, or how to think about how developers may need to tweak compression from release to release. The javadocs say "100 is a common value for normal uses".

I would have expected a given compression X to be "equivalent" from release to release in one of these two dimensions:

  1. compression X achieves the same memory usage (and hopefully improved "quality")
  2. compression X achieves the same "quality" (and hopefully improved memory usage)

But it seems like compression is a more nuanced value? (ie, based on your statements and my findings, X=100 uses less memory, but also less "quality" on 3.3, so it doesn't follow one of the two dimensions above).

I wonder if it would be useful to have user-level abstractions that could be exposed as the two dimensions above?

Ie,

    public static double qualityToCompression(double quality) {
        return quality * QUALITY_TO_COMPRESS_FACTOR; // maybe it's not a constant factor, but more complex function
    }
    
    public static double memoryToCompression(double memory) {
        return quality * MEMORY_TO_COMPRESS_FACTOR; // maybe it's not a constant factor, but more complex function
    }

Then user-code would be able to try and lock-in on the mode they are trying to optimize for:

TDigest tDigestForTesting = createDigest(qualityToCompression(100.0));
...
TDigest tDigestForProduction = createDigest(memoryToCompression(...));
...

This is a bit long winded... but I very much appreciate your response, and will get back after I analyze the memory usage a bit more.

devinrsmith avatar Dec 15 '22 20:12 devinrsmith

Sorry to be slow.

The compression factor is an upper bound on the number of centroids.

Half the compression factor is a lower bound on the number of centroids for K_1 scale factor (and practically true for K_2 and K_3).

On Thu, Dec 15, 2022 at 12:24 PM Devin Smith @.***> wrote:

The test that passes in 3.2 and fails in 3.3 is essentially a (seeded) uniform random distribution of 10000 doubles in the range [-10000.0, 10000.0]. We check abs((p_x-t_x)/p_x) < 0.005 for x in [75, 99, 99.9] (p_x is percentile, t_x is t-digest with compression 100). (The test also does the same checks on 4 non-overlapping subsets that in union equal the original distribution.)

Upon seeing these fail in 3.3, I added a single root-mean-square error calculation that accumulates the errors across this distribution, as well as re-seeding the test w/ 1000 distinct seeds. A bit hand-wavy, but in the end the calculation produces

3.2 RMSE = 0.00090 3.3 RMSE = 0.00127

I suspect I might find

upping the default value to 200 had the desired effect (same memory use, improved accuracy)

, so I will look into that.

I think another factor is I don't have a good idea of what "compression" is, or how to think about how developers may need to tweak compression from release to release. The javadocs say "100 is a common value for normal uses".

I would have expected a given compression X to be "equivalent" from release to release in one of these two dimensions:

  1. compression X achieves the same memory usage (and hopefully improved "quality")
  2. compression X achieves the same "quality" (and hopefully improved memory usage)

But it seems like compression is a more nuanced value? (ie, based on your statements and my findings, X=100 uses less memory, but also less "quality" on 3.3, so it doesn't follow one of the two dimensions above).

I wonder if it would be useful to have user-level abstractions that could be exposed as the two dimensions above?

Ie,

public static double qualityToCompression(double quality) {
    return quality * QUALITY_TO_COMPRESS_FACTOR; // maybe it's not a constant factor, but more complex function
}

public static double memoryToCompression(double memory) {
    return quality * MEMORY_TO_COMPRESS_FACTOR; // maybe it's not a constant factor, but more complex function
}

Then user-code would be able to try and lock-in on the mode they are trying to optimize for:

TDigest tDigestForTesting = createDigest(qualityToCompression(100.0)); ...TDigest tDigestForProduction = createDigest(memoryToCompression(...)); ...

This is a bit long winded... but I very much appreciate your response, and will get back after I analyze the memory usage a bit more.

— Reply to this email directly, view it on GitHub https://github.com/tdunning/t-digest/issues/205#issuecomment-1353665229, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6WQERLYRJ6Y7Q5BZV3WNN46LANCNFSM6AAAAAAS6ZICLE . You are receiving this because you commented.Message ID: @.***>

tdunning avatar Dec 23 '22 02:12 tdunning