stream-lib
stream-lib copied to clipboard
TDigest percentiles are off on the last centroid (and close by percentiles have inverted values)
I have a serialized TDigest in Base64: AAAAAkBZAAAAAAAAAAABLkkzm5BGwGAARy0nAEYp0ABGlTYARLegAEU48ABHBuMARTgQAESsIABFSPAARDqAAEaLFgBEpKAARYYwAEQVgABE9mAARTdgAESuYABD3QAARiY4AEWVgABDMwAARTeAAEXByABEYcAARajAAEQwwABF5RgARShwAENWAABF/WAARMAgAEQxAABFGEAAQsoAAEUZMABF4wAARaVgAEV54ABFbuAARVYgAEQqQABGLPwARKngAEQwgABEhOAARIzgAER7gABEoaAARPOgAEXFoABFLkAARHWAAEVDwABE2WAAQ3kAAESfIABFdbAARX6gAESHAABELoAARXxQAETDgABDNQAARa5wAEU6oABFIkAARcg4AESlIABEpDAARcM0AEWy+ABFZlgARPSwAEVRcABE+aAAQmQAAESAgABF22wARZh4AESmUABEbkAAQ+yAAEQ24ABFb3AARdYoAEXyNABD70AARgjMq0WGQqtFxTgARQDAAEUBAABCNAAARRCQAEU/aABEy5AARPlAAETwkABFL1gARa0wAESiYABFR/AARJaVVUUmLVVF5uQARbnSq0Wc7VVFXVAARaTwAEWFtABFFpgARGJAAESQ0ABFMbAAQ08qqkW40qtFgIgARZYYAEWNUABFmOgARD2VVUWYM1VFyWYARWqiq0UGkABE3JqrRWFIAESeAABF+34ARUR0AETgwABElVAARIagAEUfcABFhZFVRKyqq0YD1qtFbPqqRMTdVUPG4ABFNOVVRMUqq0UNiqtE0AAARZe6q0XLUKtFaBAARDLQAEVwSqtFpQarRYb8AEXYSABFNlAARI5gAET+iABFgt4ARYlQAEVdkABEv3AARU5sAEWchgBFPHAARSGgAEW6iABFH/AARQFQAESnqABEZbAARNlwAETB5VVEpVVVRNoFVUUA2ABFVNVVRQ1qq0TMeABFTC6rRcuNVUTVVVVFX6AAROG1VUVlAABF2h1VRYf5VUUarVVFrKVVRMxAAETy9VVE6cqrRNtwAEXxbABFEeVVRTcwAEUOGqpEynVVRTU1VUWxWABEv0AARQ3YAEUN4qtGDBQAROLKq0VnIABFWcAARfqYAEWSaABFhSgAReyiq0XmhVVGEVFVRYohVUUOqABE72AARRvgAEYUigBFVTgARP8QAETH0ABGHsAARSs4AEUgSABFfLAARQ8gAEWXuABFTVAARl1AAEUBQABDuIAAQ6uAAEUDkABD/4AARbl4AEVbEABFEOAAReMIAEXSkABGL8AARW8wAEYHTABFFbAAQ0gAAEUBsABFt4AARY8wAEY0FABGDwwARJWAAEVIoABFSZAARZVoAEYOsABFNVAARgWwAEU5EABForgARSAQAEXb8ABFStAARaa4AEPTgABErMAARU8gAEQVwABE7CAARh8UAET1gABGN+AARn9sAEQlAABFOsAARZ1oAEPlgABGIoQARhb8AEW8GABD8oAARhOMAEVXgABE7AAARtkeAEZZ/ABGl7gARtqgAEW9qABGybIARnG0AEUs4ABGiKoAQ9eAAEeglQBGCMwARnf0AEbJvgBGfmgAR0fpAEgXB4BHsy4ASAMpAEgYdkBMBzAdAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQECAQEBAQECAgEBAQICAgIBAgEBAQICAQECAgICAgIDAQMBAgEBAgEBAgMCAQIDAgIDAwMBAgMBAgIDAgEDAwEDBAIDAwICAwQBAgICAgMDAwMEAQMDAwMDBAQBAwQEAgEEBAIEAwIEAgQBAgMBBAICAwMCAgMDBAMDAQMDAwMCAwMCAwMCAQMDAgMCAgMCAwMBAgIDAQMDAQMCAQECAgECAQICAQECAgEBAQEBAQEBAQECAQICAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQE=
To get it back into a TDigest tree: byte[] data = DatatypeConverter.parseBase64Binary(encoded); ByteBuffer bb = ByteBuffer.wrap(data); TDigest td = TDigest.fromBytes(bb);
where encode is the base64 string above.
when you query for quantiles it happens that: td.quantile(0.999d) > td.quantile(0.9999d) which should never happen.
I believe part of th eproblem is on https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java line 320 (https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java#L320). I believe in that line (center.count() - (q - t)) should be instead (q - t) Symmetrically line 317 (https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java#L317) should move from (q - t) to (center.count() - (q - t)) restablishing the symmetry.
The same holds true for lines 300 and 303 (theu have to have the (q-t) and (center.count() - (q - t)) swapped around. I tried that td.quantile(0.5d) > td.quantile(0.5001d).
I am not sure it is the right fix but the invariants should be respected: td.quantile(0.999d)<=td.quantile(0.9999d) as well as td.quantile(0.5d)<=td.quantile(0.5001d)
Thanks
Interesting bug!
could you write up a test that reproduces this?
On Tue, Sep 16, 2014 at 6:18 PM, phoenixdownita [email protected] wrote:
I have a serialized TDigest in Base64: AAAAAkBZAAAAAAAAAAABLkkzm5BGwGAARy0nAEYp0ABGlTYARLegAEU48ABHBuMARTgQAESsIABFSPAARDqAAEaLFgBEpKAARYYwAEQVgABE9mAARTdgAESuYABD3QAARiY4AEWVgABDMwAARTeAAEXByABEYcAARajAAEQwwABF5RgARShwAENWAABF/WAARMAgAEQxAABFGEAAQsoAAEUZMABF4wAARaVgAEV54ABFbuAARVYgAEQqQABGLPwARKngAEQwgABEhOAARIzgAER7gABEoaAARPOgAEXFoABFLkAARHWAAEVDwABE2WAAQ3kAAESfIABFdbAARX6gAESHAABELoAARXxQAETDgABDNQAARa5wAEU6oABFIkAARcg4AESlIABEpDAARcM0AEWy+ABFZlgARPSwAEVRcABE+aAAQmQAAESAgABF22wARZh4AESmUABEbkAAQ+yAAEQ24ABFb3AARdYoAEXyNABD70AARgjMq0WGQqtFxTgARQDAAEUBAABCNAAARRCQAEU/aABEy5AARPlAAETwkABFL1gARa0wAESiYABFR/AARJaVVUUmLVVF5uQARbnSq0Wc7VVFXVAARaTwAEWFtABFFpgARGJAAESQ0ABFMbAAQ08qqkW40qtFgIgARZYYAEWNUABFmOgARD2VVUWYM1VFyWYARWqiq0UGkABE3JqrRWFIAESeAABF+34ARUR0AETgwABElVAARIagAEUfcABFhZFVRKyqq0YD1qtFbPqqRMTdVUPG4ABFNOVVRMUqq0UNiqtE0AAARZe6q0XLUKtFaBAARDLQAEVwSqtFpQarRYb8AEXYSABFNlAARI5gAET+iABFgt4ARYlQAEVdkABEv3AARU5sAEWchgBFPHAARSGgAEW6iABFH/AARQFQAESnqABEZbAARNlwAETB5VVEpVVVRNoFVUUA2ABFVNVVRQ1qq0TMeABFTC6rRcuNVUTVVV VFX6AARO G1VUVlAABF2h1VRYf5VUUarVVFrKVVRMxAAETy9VVE6cqrRNtwAEXxbABFEeVVRTcwAEUOGqpEynVVRTU1VUWxWABEv0AARQ3YAEUN4qtGDBQAROLKq0VnIABFWcAARfqYAEWSaABFhSgAReyiq0XmhVVGEVFVRYohVUUOqABE72AARRvgAEYUigBFVTgARP8QAETH0ABGHsAARSs4AEUgSABFfLAARQ8gAEWXuABFTVAARl1AAEUBQABDuIAAQ6uAAEUDkABD/4AARbl4AEVbEABFEOAAReMIAEXSkABGL8AARW8wAEYHTABFFbAAQ0gAAEUBsABFt4AARY8wAEY0FABGDwwARJWAAEVIoABFSZAARZVoAEYOsABFNVAARgWwAEU5EABForgARSAQAEXb8ABFStAARaa4AEPTgABErMAARU8gAEQVwABE7CAARh8UAET1gABGN+AARn9sAEQlAABFOsAARZ1oAEPlgABGIoQARhb8AEW8GABD8oAARhOMAEVXgABE7AAARtkeAEZZ/ABGl7gARtqgAEW9qABGybIARnG0AEUs4ABGiKoAQ9eAAEeglQBGCMwARnf0AEbJvgBGfmgAR0fpAEgXB4BHsy4ASAMpAEgYdkBMBzAdAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQECAQEBAQECAgEBAQICAgIBAgEBAQICAQECAgICAgIDAQMBAgEBAgEBAgMCAQIDAgIDAwMBAgMBAgIDAgEDAwEDBAIDAwICAwQBAgICAgMDAwMEAQMDAwMDBAQBAwQEAgEEBAIEAwIEAgQBAgMBBAICAwMCAgMDBAMDAQMDAwMCAwMCAwMCAQMDAgMCAgMCAwMBAgIDAQMDAQMCAQECAgECAQICAQECAgEBAQEBAQEBAQECAQICAQEBAQEBAQEBAQEBAQEBAQEBAQEBA QEBAQEBA QEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQE=
To get it back into a TDigest tree: byte[] data = DatatypeConverter.parseBase64Binary(encoded); ByteBuffer bb = ByteBuffer.wrap(data); TDigest td = TDigest.fromBytes(bb);
where encode is the base64 string above.
when you query for quantiles it happens that: td.quantile(0.999d) > td.quantile(0.9999d) which should never happen.
I believe part of th eproblem is on https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java line 320 ( https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java#L320 ). I believe in that line (center.count() - (q - t)) should be instead (q - t) Symmetrically line 317 ( https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java#L317 ) should move from (q - t) to (center.count() - (q - t)) restablishing the symmetry.
The same holds true for lines 300 and 303 (theu have to have the (q-t) and (center.count() - (q - t)) swapped around. I tried that td.quantile(0.5d) > td.quantile(0.5001d).
I am not sure it is the right fix but the invariants should be respected: td.quantile(0.999d)<=td.quantile(0.9999d) as well as td.quantile(0.5d)<=td.quantile(0.5001d)
Thanks
— Reply to this email directly or view it on GitHub https://github.com/addthis/stream-lib/issues/83.
You can literally: String encoded = "..." <-- the big Base64 String I attached above form the first A to the last = byte[] data = DatatypeConverter.parseBase64Binary(encoded); //this is a javax class available on openJDK so it should be available on your preferred JVM/JDK combo ByteBuffer bb = ByteBuffer.wrap(data); TDigest td = TDigest.fromBytes(bb); //TDigest from stream-lib
then: double td1= td.quantile(0.999d); double td2 = td.quantile(0.9999d);
and you'll see td1 > td2.
Those 2 values 0.999d and 0.9999d are pretty useful in this case because the land the algo in the last of the centroids. While debugging I could see all other variables in that function were identical at line 320, just q for 0.9999 was bigger than q for 0.999.
If instead you'd want the actual raw data that created the TDigest in the first place, I have to say I don't have a repro for it that way. Assuming there are no bugs in the serialization of the TDigest (that is how I got the base64 string in the first place, I used the smallbytes version), deserializing from the base64 string should be as good as I can get it.
I applied to my own "copy" of the TDigest the inversion of (q-t) with (center.count() - (q - t)) in those 4 lines involved and literally used that base64 in a UTest to assert that even with small delta quantiles (say 50% and 50.000001%) it respects the invariant that the value of the bigger quantile should be bigger or equal to the value of the lesser one and it seems to work.
I am not too familiar with the code or theory behind it although I skimmed thru the paper, I just use it as is to compute quantiles in a distributed fashion, but I believe that the piece of code I suggest to "fix" has to do with interpolating the value based on how many elements are contained in the centroid and if the center of said centroid is to the left or the right of the sough after percentile. If that intuition is right then I believe I can say that (center.count() - (q - t)) gets smaller (closer to the center) and not bigger as q gets bigger than t (but not enough to need to go in the next centroid). So I believe that when on the right side of the centroid (q-t) is appropriate as even slightly bigger values of q yield bigger values of the interpolated percentile (line 320). Line 317 I did for symmetry, thinking that maybe the coder inadvertently swapped the 2 interpolation functions. I do not have an example to prove 317 is required. Lines 300 and 303 had to be "swapped" as well as they follow the same logic but for the case in which the requested percentile falls in an intermediate centroid and not just the last ... in this case I had tests that hit both conditions and proved the swap "fixed" the invariant.
I am not saying that the swap of code snippets that I propose is the right fix, just that with it values do not break the invariant, but my "fix" may as well be the wrong fix in terms of actual values that should be returned (not just their relative order).
Does it make sense? I hope with the info here you can repro the situation and either justify the behavior (maybe serialization/deserialization breaks some assumptions) or work on a the real fix, mine was literally done on a hunch ... so much for math proofing it.
On Wed, Sep 17, 2014 at 5:18 PM, phoenixdownita [email protected] wrote:
Does it make sense? I hope with the info here you can repro the situation and either justify the behavior (maybe serialization/deserialization breaks some assumptions) or work on a the real fix, mine was literally done on a hunch ... so much for math proofing it.
Yeah... I can go with the serialized image, but since there are several implementations, it is kind of hairy thinking about that. Having data that causes the problem might be nicer.
But I think you are right that it is conflicting conventions and assumptions at work.
I have a source data example:
int maxVal = 31;
TDigest td = new TDigest(100);
for(int i = 1; i <= maxVal; i++) {
td.add(i);
}
double tp999 = td.quantile(0.999d);
double tp9999 = td.quantile(0.9999d);
assertTrue(tp9999 >= tp999);
The assert will trigger because of the inversion mentioned.
Also they show an imprecision of the algo, namely they both are > 31 (as even for the last centroid the assumption is that half the count is one side the other on the other side [even when the count is 1]). Not sure if we can amend that for the last centroid in all cases (or only in the case of count=1) to enforce that no percentile can be bigger then the biggest element seen (which I believe is precise only when the count = 1).
Anyway hope this helps you in "fixing" the issue or vetting my solution of replacing (q - t) with (center.count() - (q - t)) and viceversa.
Thanks. Excellent input on this.
We changed the computation of quantiles previously to address a different issue. That means that I have to actually think about this for a bit.
It is always possible that this is a matter of definitions. For that matter, your fix may be the continuation of the previous fix with each being on one side or the other of the quantile computation (x -> q or q -> x).
I have just been swamped lately which has made it difficult to think hard about a problem like this for a few hours.
On Thu, Sep 25, 2014 at 2:28 PM, phoenixdownita [email protected] wrote:
I have a source data example:
int maxVal = 31; TDigest td = new TDigest(100); for(int i = 1; i <= maxVal; i++) { td.add(i); } double tp999 = td.quantile(0.999d); double tp9999 = td.quantile(0.9999d); assertTrue(tp9999 >= tp999);
The assert will trigger because of the inversion mentioned.
Also they show an imprecision of the algo, namely they both are > 31 (as even for the last centroid the assumption is that half the count is one side the other on the other side [even when the count is 1]). Not sure if we can amend that for the last centroid in all cases (or only in the case of count=1) to enforce that no percentile can be bigger then the biggest element seen (which I believe is precise only when the count = 1).
Anyway hope this helps you in "fixing" the issue or vetting my solution of replacing (q - t) with (center.count() - (q - t)) and viceversa.
— Reply to this email directly or view it on GitHub https://github.com/addthis/stream-lib/issues/83#issuecomment-56886872.
This is the code I have in place as a fix (I just put it here in case, but I make no assertion for it to be the correct fix in terms of the values that are returned). it inverts the (q-t) with (center.count() - (q - t)) AND in case of a single centroid rather than failing it returns the value of it as any quantile (a single value is the collapse of all quantiles anyway).
public double quantile(double q) {
GroupTree values = summary;
//Preconditions.checkArgument(values.size() > 1);
//fix to allow single value
Preconditions.checkArgument(values.size() >= 1);
Iterator<Group> it = values.iterator();
Group center = it.next();
//fix for single value return the value
if(!it.hasNext()) {
//only 1 centroid (with 1 value)
return center.mean();
}
Group leading = it.next();
if (!it.hasNext()) {
// only two centroids because of size limits
// both a and b have to have just a single element
double diff = (leading.mean() - center.mean()) / 2;
if (q > 0.75) {
return leading.mean() + diff * (4 * q - 3);
} else {
return center.mean() + diff * (4 * q - 1);
}
} else {
q *= count;
double right = (leading.mean() - center.mean()) / 2;
// we have nothing else to go on so make left hanging width same as right to start
double left = right;
double t = center.count();
while (it.hasNext()) {
if (t + center.count() / 2 >= q) {
// left side of center
//return center.mean() - left * 2.0d * (q - t) / center.count();
//fix
return center.mean() - left * 2.0d * (center.count() - (q - t)) / center.count();
} else if (t + leading.count() >= q) {
// right of b but left of the left-most thing beyond
//return center.mean() + right * 2.0 * (center.count() - (q - t)) / center.count();
//fix
return center.mean() + right * 2.0d * (q - t) / center.count();
}
t += center.count();
center = leading;
leading = it.next();
left = right;
right = (leading.mean() - center.mean()) / 2;
}
// ran out of data ... assume final width is symmetrical
center = leading;
left = right;
if (t + center.count() / 2 >= q) {
// left side of center
//return center.mean() - left * 2.0d * (q - t) / center.count();
//fix
return center.mean() - left * 2.0d * (center.count() - (q - t)) / center.count();
} else if (t + leading.count() >= q) {
// right of center but left of leading
//return center.mean() + right * 2.0d * (center.count() - (q - t)) / center.count();
//fix
return center.mean() + right * 2.0d * (q - t) / center.count();
} else {
// shouldn't be possible
return 1;
}
}
}
I decided to actually take the impl of this functions from: https://github.com/tdunning/t-digest/blob/master/src/main/java/com/tdunning/math/stats/TreeDigest.java as it seems to work much better (and no issues of inversions etc...) It takes very minor changes (Group vs Centroid naming) so it wasn't an hassle at all.
I guess this impl was left behind ... no biggie, now I know.