OnlineStats.jl icon indicating copy to clipboard operation
OnlineStats.jl copied to clipboard

Probabilistic Data Structures

Open joshday opened this issue 7 years ago • 11 comments
trafficstars

  • [x] HyperLogLog
  • [x] HyperLogLog ++
  • [x] Count Min Sketch
  • [ ] Bloom Filter
  • [ ] MinHash

joshday avatar Nov 22 '17 20:11 joshday

Interested in a t-digest as well?

https://www.sciencedirect.com/science/article/pii/S2665963820300403#:~:text=The%20t%2Ddigest%20is%20an,of%20data%20with%20arbitrary%20distribution.

tdunning avatar May 23 '22 23:05 tdunning

Sure! That looks very interesting. I'll give the paper a closer look, but I'm not sure if I'll have the bandwidth to work on implementing it for a while.

joshday avatar May 24 '22 01:05 joshday

Let me see if I can get somewhere quickly from my Java version.

tdunning avatar May 24 '22 01:05 tdunning

So I have a working version for the cdf function. The quantile function isn't there yet.

It needs more tests. And it needs to conform to your API expectations. And I would like to add in a logHistogram implementation as well.

How did you imagine this should work? Right now I have a fit!(MergingDigest, Vector{<:Number}) method (and a non vector parallel) that adds data to the digest and a cdf(MergingDigest, value::Number) call to compute the empirical CDF of a particular value. I plan to add a quantile(MergingDigest, q::Number) version.

I currently have a length(MergingDigest) function, but that seems to be inconvenient on balance because it force a user to wrap a digest with Ref whenever broadcasting is used. I may retreat to a samples(MergingDigest) method or just document the idiom length(m.sketch).

Is this what you are expecting?

tdunning avatar May 27 '22 14:05 tdunning

How did you imagine this should work? Right now I have a fit!(MergingDigest, Vector{<:Number}) method (and a non vector parallel) that adds data to the digest and a cdf(MergingDigest, value::Number) call to compute the empirical CDF of a particular value. I plan to add a quantile(MergingDigest, q::Number) version.

Check out the OnlineStat interface in the README here: https://github.com/joshday/OnlineStatsBase.jl. For quantile, just make sure you are extending the Statistics.quantile method. For cdf, we can't export it in order to avoid a conflict with Distributions.cdf, but we can point out that it exists as OnlineStats.cdf in the docs.

I currently have a length(MergingDigest) function, but that seems to be inconvenient on balance because it force a user to wrap a digest with Ref whenever broadcasting is used. I may retreat to a samples(MergingDigest) method or just document the idiom length(m.sketch).

As long as MergingDigest is a subtype of OnlineStat, you shouldn't need to worry about broadcasting/Refs, since this lives in OnlineStatsBase.jl:

Broadcast.broadcastable(o::OnlineStat) = Ref(o)

joshday avatar May 27 '22 15:05 joshday

That's very helpful. In the short-term, even before I step up to integrating I will add the broadcastable snippet directly.

Your comment about cdf being a collision makes me wonder if I shouldn't make a MergingDigest be a Distribution somehow. There is no real reason I couldn't support sampling from the empirical distribution that way.

tdunning avatar May 27 '22 16:05 tdunning

In fact, anything that supports a quantile operation can be sampled.

tdunning avatar May 27 '22 16:05 tdunning

In fact, anything that supports a quantile operation can be sampled.

A package that implements this would be pretty cool. e.g.

x = QuantileSampler(thing_with_quantile_method)

rand(x)

But for all I know this already exists in Distributions.jl or elsewhere.

Edit: Maybe never mind. You're thinking of this?:

quantile(thing, rand())

joshday avatar May 27 '22 17:05 joshday

See https://github.com/tdunning/TDigest/ for a beginning.

I think this does everything needed, but it isn't tested hard enough to make me super confident of that.

tdunning avatar May 29 '22 06:05 tdunning

Comments and criticisms are VERY welcome

tdunning avatar May 29 '22 06:05 tdunning

There are some issues still on the TDigest.jl largely around sort stability. Secondary problem is lack of mind share on my part due to day job.

Not surprisingly, the Julia implementation is simpler than the Java version.

tdunning avatar Jul 28 '22 15:07 tdunning