OnlineStats.jl
OnlineStats.jl copied to clipboard
Probabilistic Data Structures
- [x] HyperLogLog
- [x] HyperLogLog ++
- [x] Count Min Sketch
- [ ] Bloom Filter
- [ ] MinHash
Interested in a t-digest as well?
https://www.sciencedirect.com/science/article/pii/S2665963820300403#:~:text=The%20t%2Ddigest%20is%20an,of%20data%20with%20arbitrary%20distribution.
Sure! That looks very interesting. I'll give the paper a closer look, but I'm not sure if I'll have the bandwidth to work on implementing it for a while.
Let me see if I can get somewhere quickly from my Java version.
So I have a working version for the cdf function. The quantile function isn't there yet.
It needs more tests. And it needs to conform to your API expectations. And I would like to add in a logHistogram implementation as well.
How did you imagine this should work? Right now I have a fit!(MergingDigest, Vector{<:Number}) method (and a non vector parallel) that adds data to the digest and a cdf(MergingDigest, value::Number) call to compute the empirical CDF of a particular value. I plan to add a quantile(MergingDigest, q::Number) version.
I currently have a length(MergingDigest) function, but that seems to be inconvenient on balance because it force a user to wrap a digest with Ref whenever broadcasting is used. I may retreat to a samples(MergingDigest) method or just document the idiom length(m.sketch).
Is this what you are expecting?
How did you imagine this should work? Right now I have a fit!(MergingDigest, Vector{<:Number}) method (and a non vector parallel) that adds data to the digest and a cdf(MergingDigest, value::Number) call to compute the empirical CDF of a particular value. I plan to add a quantile(MergingDigest, q::Number) version.
Check out the OnlineStat interface in the README here: https://github.com/joshday/OnlineStatsBase.jl. For quantile, just make sure you are extending the Statistics.quantile method. For cdf, we can't export it in order to avoid a conflict with Distributions.cdf, but we can point out that it exists as OnlineStats.cdf in the docs.
I currently have a length(MergingDigest) function, but that seems to be inconvenient on balance because it force a user to wrap a digest with Ref whenever broadcasting is used. I may retreat to a samples(MergingDigest) method or just document the idiom length(m.sketch).
As long as MergingDigest is a subtype of OnlineStat, you shouldn't need to worry about broadcasting/Refs, since this lives in OnlineStatsBase.jl:
Broadcast.broadcastable(o::OnlineStat) = Ref(o)
That's very helpful. In the short-term, even before I step up to integrating I will add the broadcastable snippet directly.
Your comment about cdf being a collision makes me wonder if I shouldn't make a MergingDigest be a Distribution somehow. There is no real reason I couldn't support sampling from the empirical distribution that way.
In fact, anything that supports a quantile operation can be sampled.
In fact, anything that supports a quantile operation can be sampled.
A package that implements this would be pretty cool. e.g.
x = QuantileSampler(thing_with_quantile_method)
rand(x)
But for all I know this already exists in Distributions.jl or elsewhere.
Edit: Maybe never mind. You're thinking of this?:
quantile(thing, rand())
See https://github.com/tdunning/TDigest/ for a beginning.
I think this does everything needed, but it isn't tested hard enough to make me super confident of that.
Comments and criticisms are VERY welcome
There are some issues still on the TDigest.jl largely around sort stability. Secondary problem is lack of mind share on my part due to day job.
Not surprisingly, the Julia implementation is simpler than the Java version.