t-digest
t-digest copied to clipboard
Decay TDigest
Hi, I am interested in a time decayed version of TDigest. I saw there was a discussion about it in issue #127. Has there been any progress about this problem ?
It's quite possible to build something kind of like a time decayed t-digest, but there are some seriously difficult issues to overcome.
The first one is to decide what you actually mean by the invariant of the scale function. Right now, we allow violation of the scale function invariant in the case of centroids that have a single sample. This is reasonable because you don't lose any accuracy in the estimation of quantiles or the cumulative distribution in these cases. But if you allow these single counts to decay and to partial counts, then you don't know which additional samples should be allowed because everything has to satisfy the invariant. In fact, a lot of the accuracy at the extreme tails that the t digest gets is precisely because individual samples are preserved near the tails. That means that this question of what do you do with partial sample weighted centroids is particularly vexed.
Interestingly, the situation is very very different if you look at the log histogram. In that case partial samples are no problem at all. The reason for the difference is that the log histogram maintains the invariant in terms of the original measurement space while the t digest maintains invariants in the quantile space. This issue is subtle but very very thorny.
I am pretty firmly of the opinion that the right answer to this is to simply store digests every sample. A common choice of sample is every minute or every 5 minutes. If you care to you can maintain accumulative for the last hour or so as well. This gives you complete flexibility in getting a digest for any time period you want because you can merge multiple digests together so easily.
On Wed, Aug 31, 2022, 01:47 rnataf @.***> wrote:
Hi, I am interested in a time decayed version of TDigest. I saw there was a discussion about it in issue #127 https://github.com/tdunning/t-digest/issues/127. Has there been any progress about this problem ?
— Reply to this email directly, view it on GitHub https://github.com/tdunning/t-digest/issues/198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6RTAWYMSAU72K3E7VTV34L35ANCNFSM6AAAAAAQBE7WKM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi Ted, thank you for your answer. I want to make sure I understand you correctly:
- My first idea was to each time period dividing by some factor the weight of all the centroids. This way, the more the time goes, the less old data has influence. If I understand you correctly, this may be a bad way since in such case we are not ensured that the centroids in tails contain/represent a single element and then it is not clear which accuracy we will get. - Conversely, we could also multiply the weight of a new entry (or simulate inserting several times the same entry) in order to give more importance to new data. This way, it would still hold that the centroids at the tail contain a single entry.
- What you're suggesting though is to use several instances of tDigest, each one for a defined time period and then merging them. Is that correct ? If so, I don't see how it would enable to simulate a time decay. Can you enlighten me a little more please ?
Yes. You have the first point correct. If a centroid has one sample, then it decays to 1/2, another sample, and it decays by a factor of 2/3, we have a weight of 1, but this represents two samples. Normally, we would use this specially for interpolation purposes to gain accuracy, but we can't do that any more.
One fix for this would be to keep separate counter of number of samples and total weight for a centroid, but once you have at least one sample, then it can never decay to the point where it can be merged if it is in the tail and thus you force other samples to be merged.
Regarding your second point, the most common desire for exponential decay of a digest is so that you can estimate the distribution over a particular (recent) time period. With short time range based digests, you don't have to estimate ... you can just combine the pertinent digests and you have exactly what you want.
If you want some sort of weighting over a time period, you can do that too. When you want to do the quantile or CDF operation you translate into composite operations on each set of digests that have the same weight.
Regarding your last point, I understand that what you suggest to do is to have one digest per time period. Then, I can decrease the weights of "old" digests when I no longer use them. Let's say now I'm in time period 2 and I decreased the weights of digests of time period 0 and 1. When I want to compute the quantile I should merge the digests of periods 0, 1 and 2 and then compute the quantile on the merged result. Did I understand you correctly ?