cascading_ext icon indicating copy to clipboard operation
cascading_ext copied to clipboard

Can I use TDigest instead of QDigest?

Open ahmadpriatama opened this issue 9 years ago • 2 comments

I'm calculating quantile like described in Liveramp blog post

but somehow, running it on production server output

Caused by: java.lang.IllegalArgumentException: Can only accept values in the range 0..4611686018427387903, got 9223372036854775807
    at com.clearspring.analytics.stream.quantile.QDigest.offer(QDigest.java:125)
    at com.liveramp.cascading_ext.combiner.lib.QuantileExactAggregator.partialAggregate(QuantileExactAggregator.java:38)
    at com.liveramp.cascading_ext.combiner.lib.QuantileExactAggregator.partialAggregate(QuantileExactAggregator.java:17)
    at com.liveramp.cascading_ext.combiner.CombinerFunctionContext.combineAndEvict(CombinerFunctionContext.java:130)
    at com.liveramp.cascading_ext.combiner.CombinerFunction.operate(CombinerFunction.java:130)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99)
    ... 11 more

and tdunning said that i should use TDigest instead of QDigest, but cacasding_ext depend on stream_lib version which not including TDigest. Any idea so i can use TDigest? I updated the dependencies version of stream lib to the latest version which include TDigest, but apparently cascading_ext have no ExactAggregator that support TDigest (QDigest use QuantileExactAggregator). What should I do?

ahmadpriatama avatar Apr 22 '15 04:04 ahmadpriatama

The fastest way to get up and running using TDigest is going to be implementing your own ExactAggregator - you can use QuantileExactAggregator as a guide, and I don't think you'll have too much trouble with it. Once you have the Aggregator, you can pass it to a Combiner the same way as QuantileExactAggregator and you should get the TDigest object you want at the end. If you have any specific issues doing that let us know and we can help.

TDigest seems pretty interesting - my guess is that @matthagy will want to have a built in aggregator for it at some point. I think we were blocked internally on upgrading our version of stream_lib here, but maybe we can take a second look at that.

pwestling avatar Apr 22 '15 16:04 pwestling

Yeah, we didn't upgrade the stream lib version because we internally have a lot of long-term persisted structs, and it's unclear whether some of the changes between 2.4 and master have caused broken serialization backwards-compatibility (would need to do more careful testing).

But that shouldn't block you from using the newest version of stream lib with cascading ext in your own project and implementing a new ExactAggregator like porter mentioned. If you do end up making one, we'd be happy to merge it in here once we manage to upgrade.

bpodgursky avatar Apr 22 '15 16:04 bpodgursky