t-digest
t-digest copied to clipboard
Questions about merging implementation
Hey, I'm working on a Go port of the MergingDigest. I noticed that you haven't finished updating your paper to cover the merging implementation, which left me with some questions about its behavior that I was hoping you could clarify.
First off, I notice that the implementations of cdfs and quantiles depend on an important assumption detailed in figure 2 of your paper: the samples assigned to a given centroid are approximately uniformly distributed between the midpoints to the adjacent centroids on either side. I generated a similar graph for your java implementation using deviation-merge.csv
and this is what I got (edit: by the way this graph contains all 3 distributions, gamma seq and uniform, which might explain its slightly uneven shape):
Most of the values are between -0.5 and 0.5 as with your original fig2, but a significant amount of values are outside that in the -1 to 1 range, and there's also a long tail of values trailing out towards the previous centroid.
Here's the graph from my implementation (I've tried uniform, exponential and normal distributions, they all had the same shape):
It looks narrower than yours but the shape is similar. What I'm wondering is, do you think the uniform distribution you assumed for the other t-digest implementations still holds up for this one? Both your implementation and mine have a distinctly nonuniform shape with a long tail towards the previous centroid. I don't know how much error is being introduced by assuming a uniform distribution between the midpoints, but it doesn't really look uniform to me.
Second, I noticed that your implementation uses a quadratic equation to estimate the size of the temporary centroid buffer, seen here. Could you explain what data this quadratic is derived from? I would like to do some tests of my own to corroborate your equation.
One more thing I was curious about - in the updated version of your paper section 2.1, you mention the mapping from quantiles to indices, implemented here. I was wondering what motivated this particular choice, and if there were any other functions with similar shapes that you've considered exploring. I was also wondering how this upper bound was computed, and how it relates to the sinusoidal mapping that was chosen.
Finally, thanks for developing this data structure, and for providing an open-source reference implementation. I look forward to reading the updated version of your paper with more details on the merging variant.
Thanks for your excellent comments.
I am re-running the distribution tests to see what has happened with the new algorithms, but I don't have answers on that yet.
The quadratic equation was just a hold-over from a previous world and was very wrong in the current world. It is now gone. The merging algorithm has hard bounds on size that are basically just 2 * compression.
Also, if you check out the latest master version of the paper, you should have a decent description of the algorithm. It is vastly simpler than before.
No progress on this, but there have been a number of changes that could improve behavior.
The distribution of values in a bin is critical to understand since a good characterization of this could improve accuracy, possibly substantially.
Tommy,
I just wanted to thank you for your observation. I have been working on the quality assessment in anticipation of the 4.0 release and the issue you have highlighted is a big deal that compromises tail accuracy significantly.
This can be seen quite dramatically by first using sorted test data. Here are the data points for the first few centroids of a typical run:
That samples that contribute to each centroid have a nice uniform distribution and the centroids have fewer and fewer samples as we get to the left end. All is as we would want.
On the other hand, if you look at what happens with randomly ordered data (but still with a really big sort buffer):
With a more rational-sized sort buffer, the tails of each centroid become elongated:
Even worse than just being elongated, these clusters are asymmetrical.
I am investigating merge strategies that might avoid this problem.
Progress report on this.
The issue that Tommy has highlighted is actually quite serious and has a bad effect on accuracy. In investigating this, I have learned a lot about different size limiting strategies as well. I will describe the side learning elsewhere, but the money graph is this one.
It is clear that the update strategy that is used in the tree digest (attach points to nearest cluster if space available) is doing a much better job and I attribute the improvement to the fact that the update strategy isn't biased. In the merging digest the updates are always done by a linear scan and that makes new data get added to centroids mostly from one side (either the left side, or on the outer side if we alternate scan direction). This causes the smearing seen in these graphs. The improvement in the graph above was had by changing from sqrt(q(1-q)) as a size limit to q(1-q). There are some normalizing factors in there, but the effect is that clusters near the limits are kept smaller for longer, hopefully limiting the effect of smear. This makes errors about 5x smaller for extreme q, but doesn't really solve the core problem yet.
Improving this situation and making it easier to run tests like this is a major priority for me right now. Of course, I have only minutes per week to work on this so progress is lamentably slow.