stream-lib
stream-lib copied to clipboard
Threadsafing HLL
Replaced all of the mutating operations of RegisterSet to be atomic so that HLL's can safely be updated by multiple threads. Tests pass on multiple runs, performance impact on the HLL appears to be negligible in the single threaded case.
Thanks for the patch. We had a similar pull request in the past, https://github.com/clearspring/stream-lib/pull/17. After some discussion we agreed that the best option is to maintain a separate HLL per thread to avoid the overhead of the atomic objects. What are your thoughts on this tradeoff?
I agree with Matt. I don't see the point to introduce this kind of complexity since estimator can be efficiently merged.
I just ran my benchmark on your branch (single threaded). Most of the operations are not slowed down but merge is ~3.5x slower than in master.
Have you benchmarked the multi-threaded case ? I am quite sure that separate HLL are more efficient due to contention, CPU cache and false sharing. I believe that we can apply the same pattern than in LongAdder if required.
Agree - remove complexity by calculating separate HLL's on desperate threads and rely on native lossless union.
While I'd prefer to stick to a single writer pattern, we simply can't avoid it in some of our use cases. My choices are either to threadproof things here or add a bunch of complexity elsewhere to maintain thread locals and then pull them back together at the appropriate instances.
In any event, I'd be happy to put in the work to speed up the merge scenario, I think that can be improved back up to where it was. But I'd prefer to only spend that time if the patch is going to be considered. Otherwise we'll simply maintain our own fork.
Can you just wrap it so that the base implementation doesn't slow down?
On Tue, Jul 2, 2013 at 3:21 PM, Cliff Moon [email protected] wrote:
While I'd prefer to stick to a single writer pattern, we simply can't avoid it in some of our use cases. My choices are either to threadproof things here or add a bunch of complexity elsewhere to maintain thread locals and then pull them back together at the appropriate instances.
In any event, I'd be happy to put in the work to speed up the merge scenario, I think that can be improved back up to where it was. But I'd prefer to only spend that time if the patch is going to be considered. Otherwise we'll simply maintain our own fork.
— Reply to this email directly or view it on GitHubhttps://github.com/clearspring/stream-lib/pull/39#issuecomment-20383167 .
Why don't we make RegisterSet an interface and then provide a new constructor that allows for selecting between safe and unsafe versions of the RegisterSet?
Before doing that it would be good to benchmark the proposed implementation against plain old locks for a realistic multi-threaded use case. Wrapping with locks provides greater flexibility (you can wrap only mutable operations if you don't care about stale reads, you can force happens-before to control the staleness of the reads, you can have mutual exclusion if it is what you need etc.) and is much simpler (easy to understand what is the semantic of complex operation like merge). I don't believe that using lock will be much slower on offer
with high contention.
In addition it will be good to keep the RegisterSet implementation simple and orthogonal to thread safety. Indeed, Hotspot is not yet able to produce SIMD code and a native implementation of RegisterSet leads to 10-20x faster merges. It would be good to keep the door open for a such optimization.
I will try to run a simple benchmark to get some numbers.
Thanks for running the benchmarks and it sounds like Cliff is willing to address any performance issues that are uncovered. So I think if we can get a performance neutral (or even positive) solution than this works for me.
SIMD RegisterSets would be great.
Here are the preliminary results. Please remember these are micro-benchmarks and it is very easy to get them wrong or meaningless. I tried to check the overhead in single thread mode and the scalability up to 4 threads (I don't have access to a bigger machine now). I also implemented a very basic locking strategy.
I ran the multi-threaded benchmark on master branch too, even if this is obviously broken. It sets a kind of upper bound.
I am using JMH, it is a great micro & macro benchmark framework from the OpenJDK guys.
Offer
Master:
Benchmark Thr Cnt Sec Mean Mean error Var Units
b.g.t.LockHLL.upstream 1 3 5 48358.123 2645.948 213217.459 ops/msec
b.g.t.LockHLL.upstream 2 3 5 94327.262 485.268 7171.734 ops/msec
b.g.t.LockHLL.upstream 3 3 5 137082.131 1572.241 75283.261 ops/msec
b.g.t.LockHLL.upstream 4 3 5 170813.806 18004.916 9872846.937 ops/msec
With Locks:
Benchmark Thr Cnt Sec Mean Mean error Var Units
b.g.t.LockHLL.synchonized 1 3 5 34995.571 3507.605 374698.117 ops/msec
b.g.t.LockHLL.synchonized 2 3 5 11022.723 1441.353 63270.469 ops/msec
b.g.t.LockHLL.synchonized 3 3 5 13060.744 1349.222 55440.481 ops/msec
b.g.t.LockHLL.synchonized 4 3 5 6762.264 2861.360 249347.562 ops/msec
Atomic Integer:
Benchmark Thr Cnt Sec Mean Mean error Var Units
b.g.t.LockHLL.atomic 1 3 5 44816.308 3039.758 281409.213 ops/msec
b.g.t.LockHLL.atomic 2 3 5 87836.623 335.387 3425.729 ops/msec
b.g.t.LockHLL.atomic 3 3 5 128128.033 485.808 7187.686 ops/msec
b.g.t.LockHLL.atomic 4 3 5 159221.834 2617.682 208686.362 ops/msec
As you can see:
- Locks sucks and do not scale (game over, I discarded locks after this single test)
- Atomic Integer performance is good and scales well
Cardinality
Master:
Benchmark Thr Cnt Sec Mean Mean error Var Units
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty 1 3 5 49261.078 1059.919 34214.156 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty 2 3 5 95714.322 1349.342 55450.317 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty 3 3 5 139771.044 2171.358 143589.665 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty 4 3 5 182426.222 304.713 2827.756 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge 1 3 5 51575.261 37.447 42.707 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge 2 3 5 100071.983 124.791 474.270 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge 3 3 5 145895.356 1852.239 104485.031 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge 4 3 5 189956.789 281.272 2409.418 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium 1 3 5 51046.894 827.130 20835.670 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium 2 3 5 99627.911 512.902 8011.779 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium 3 3 5 145333.322 608.684 11283.511 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium 4 3 5 188097.172 9235.576 2597695.224 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall 1 3 5 50542.467 485.736 7185.559 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall 2 3 5 98418.889 597.898 10887.155 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall 3 3 5 143860.939 466.348 6623.390 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall 4 3 5 187057.500 151.191 696.163 ops/sec
Atomic:
Benchmark Thr Cnt Sec Mean Mean error Var Units
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty 1 3 5 52361.422 112.263 383.826 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty 2 3 5 101250.678 857.797 22409.327 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty 3 3 5 147446.422 3290.138 329676.765 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty 4 3 5 192813.361 320.052 3119.610 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge 1 3 5 52980.911 149.175 677.723 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge 2 3 5 103238.939 734.321 16422.211 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge 3 3 5 150828.600 119.941 438.120 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge 4 3 5 196309.850 445.951 6056.688 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium 1 3 5 47801.022 1094.554 36486.721 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium 2 3 5 92897.294 2136.041 138956.628 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium 3 3 5 135471.806 1755.278 93832.231 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium 4 3 5 176407.472 528.488 8506.093 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall 1 3 5 49781.144 200.425 1223.388 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall 2 3 5 97032.644 697.386 14811.769 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall 3 3 5 141611.350 91.775 256.514 ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall 4 3 5 184142.089 1593.172 77301.121 ops/sec
Performance is good and scales well
Merge
Master:
Benchmark Thr Cnt Sec Mean Mean error Var Units
b.g.t.HLLMergeBenchmark.hllMerge 1 3 5 237.547 12.459 4.727 ops/msec
b.g.t.HLLMergeBenchmark.hllMerge 2 3 5 459.301 15.642 7.451 ops/msec
b.g.t.HLLMergeBenchmark.hllMerge 3 3 5 680.184 20.341 12.601 ops/msec
b.g.t.HLLMergeBenchmark.hllMerge 4 3 5 854.049 5.401 0.888 ops/msec
Atomic:
Benchmark Thr Cnt Sec Mean Mean error Var Units
b.g.t.HLLMergeBenchmark.hllMerge 1 3 5 70.912 7.946 1.923 ops/msec
b.g.t.HLLMergeBenchmark.hllMerge 2 3 5 138.540 3.018 0.277 ops/msec
b.g.t.HLLMergeBenchmark.hllMerge 3 3 5 202.193 3.205 0.313 ops/msec
b.g.t.HLLMergeBenchmark.hllMerge 4 3 5 236.031 116.845 415.796 ops/msec
Awful performance. Scalability seems correct. I did not have the time to perform a clean investigation. According to hprof/flamegraph it seems that the merge function is not inlined by the JIT with Atomic but it does not explain the whole story. I will try to find time to check the assembler output and CPU cache hit/miss to get a better overview of the issue.
BTW: I did not reviewed the code but I also noticed their was a bug in the implementation leading to wrong result when the update is contended (word
is not reset to zero if CAS fails)
Conclusion
- Overall scalability is good and performance too for offer and cardinality. There is a high penalty on merge, we have to check that with a macro-benchmark. I will run Cliff's branch on one of our job to see if the same behaviour is observed. I someone else could do the same it would be good to cross our results.
- I'm a bit worried about the global semantic of a such change. For example HLL++ must be thread safe in sparse mode, operations implemented in HLL(++) too. A clear semantic must be defined, documented and ensured for each operation
- It could be confusing to have a single thread-safe estimator while the others are not.
- We should check that it will not prevent us to write a native high performance implementation of
merge
(read C++/JNI) since I'm not aware of any foreseeable change regarding SIMD vs Hotspot. - We have to be sure we are not adding too much complexity for a very specific use case, perhaps we can find a better design to allow easier customization.
First, very nice work on the benchmarks.
Regarding the cost of merge, I don't see a huge cost since the merge cost will be amortized across many, many additions.
Hey great work on the benches. Glad to see it performing well. I agree the HLL++ will need tests to get threadsafed. I would also be game to take a crack at threadsafing countmin as well. As for the merge performance I'd be willing to bet it's due to the eliding of System.arraycopy from bits(). Will think on how to improve this.
Also would you mind sharing your benchmarking code? We've got test machines with 24 - 32 cores, would be nice to produce a big old spread of measurements across larger numbers of threads, as well as give the JVM a warmup period for JIT to kick in.
Regarding the cost of merge, I don't see a huge cost since the merge cost will be amortized across many, many additions.
It depends of your use case. HLL(++) are often used in web analytics. You precompute things like unique visitors according bazillion of criteria (time-range, hierarchy etc.), and store them on disk. This phase can be done is parallel on many machines and scalability is easy. Then you merge the estimator to create the final report(s). It can be done on the fly or in a batch so you want fast response time or good throughput. The performance goal is to be IO bound, not CPU bound.
On my machine I achieve 300MB/s with the master branch but "only" 100MB/s with current Cliff's code (LOG2M = 14). So it's a regression if you have fast disks or network and want to keep your code simple and single-threaded. It is not a show-stopper but we should try to avoid this slow down if possible since it impacts real workloads.
ICardinality agg = new HyperLogLog(LOG2M);
try (FileInputStream fis = new FileInputStream(file)) {
while (true) {
int s = fis.read(buf, 0, len);
if (s == -1) break;
byteCount += s;
agg = agg.merge(HyperLogLog.Builder.build(buf));
}
}
As for the merge performance I'd be willing to bet it's due to the eliding of System.arraycopy from bits(). Will think on how to improve this
I did not have the time to investigate further today but I don't think this is the issue. In my private repo I have added a mutable merge operation implemented like this:
+ public void merge(HyperLogLog other) throws HyperLogLogMergeException {
+ if (other.sizeof() != sizeof())
+ {
+ throw new HyperLogLogMergeException("Cannot merge estimators of different sizes");
+ }
+ registerSet.merge(other.registerSet);
+ }
A you can see it bypass the getBits
call. I observe exactly the same performances that with the immutable merge method both on master and your branch.
BTW: Adding a mutable merge method would be great anyway.
Also would you mind sharing your benchmarking code?
I will upload it ASAP, most likely next week. It is rudimentary but could be a starting point to try to monitor/improve stream-lib performances.
Ping on benchmarking code.
Keep in mind that the paint is wet. Comment, contribution & criticism are welcome. Let me known if you need help.
Any update on this one? We are considering merging another branch from #38 into master soon. That branch improves the serialization and memory footprint of the HLL Plus considerably but will add more thread safety concerns. I'm happy to wait to merge the results from this issue into the other branch in order to avoid breaking thread safety if you are going to be able to push your changes soon.
any update on this pull request ?
thanks