stream-lib icon indicating copy to clipboard operation
stream-lib copied to clipboard

Threadsafing HLL

Open moonpolysoft opened this issue 11 years ago • 17 comments

Replaced all of the mutating operations of RegisterSet to be atomic so that HLL's can safely be updated by multiple threads. Tests pass on multiple runs, performance impact on the HLL appears to be negligible in the single threaded case.

moonpolysoft avatar Jul 02 '13 18:07 moonpolysoft

Thanks for the patch. We had a similar pull request in the past, https://github.com/clearspring/stream-lib/pull/17. After some discussion we agreed that the best option is to maintain a separate HLL per thread to avoid the overhead of the atomic objects. What are your thoughts on this tradeoff?

abramsm avatar Jul 02 '13 18:07 abramsm

I agree with Matt. I don't see the point to introduce this kind of complexity since estimator can be efficiently merged.

I just ran my benchmark on your branch (single threaded). Most of the operations are not slowed down but merge is ~3.5x slower than in master.

Have you benchmarked the multi-threaded case ? I am quite sure that separate HLL are more efficient due to contention, CPU cache and false sharing. I believe that we can apply the same pattern than in LongAdder if required.

cykl avatar Jul 02 '13 18:07 cykl

Agree - remove complexity by calculating separate HLL's on desperate threads and rely on native lossless union.

terrancesnyder avatar Jul 02 '13 21:07 terrancesnyder

While I'd prefer to stick to a single writer pattern, we simply can't avoid it in some of our use cases. My choices are either to threadproof things here or add a bunch of complexity elsewhere to maintain thread locals and then pull them back together at the appropriate instances.

In any event, I'd be happy to put in the work to speed up the merge scenario, I think that can be improved back up to where it was. But I'd prefer to only spend that time if the patch is going to be considered. Otherwise we'll simply maintain our own fork.

moonpolysoft avatar Jul 02 '13 22:07 moonpolysoft

Can you just wrap it so that the base implementation doesn't slow down?

On Tue, Jul 2, 2013 at 3:21 PM, Cliff Moon [email protected] wrote:

While I'd prefer to stick to a single writer pattern, we simply can't avoid it in some of our use cases. My choices are either to threadproof things here or add a bunch of complexity elsewhere to maintain thread locals and then pull them back together at the appropriate instances.

In any event, I'd be happy to put in the work to speed up the merge scenario, I think that can be improved back up to where it was. But I'd prefer to only spend that time if the patch is going to be considered. Otherwise we'll simply maintain our own fork.

— Reply to this email directly or view it on GitHubhttps://github.com/clearspring/stream-lib/pull/39#issuecomment-20383167 .

tdunning avatar Jul 03 '13 00:07 tdunning

Why don't we make RegisterSet an interface and then provide a new constructor that allows for selecting between safe and unsafe versions of the RegisterSet?

abramsm avatar Jul 03 '13 15:07 abramsm

Before doing that it would be good to benchmark the proposed implementation against plain old locks for a realistic multi-threaded use case. Wrapping with locks provides greater flexibility (you can wrap only mutable operations if you don't care about stale reads, you can force happens-before to control the staleness of the reads, you can have mutual exclusion if it is what you need etc.) and is much simpler (easy to understand what is the semantic of complex operation like merge). I don't believe that using lock will be much slower on offer with high contention.

In addition it will be good to keep the RegisterSet implementation simple and orthogonal to thread safety. Indeed, Hotspot is not yet able to produce SIMD code and a native implementation of RegisterSet leads to 10-20x faster merges. It would be good to keep the door open for a such optimization.

I will try to run a simple benchmark to get some numbers.

cykl avatar Jul 03 '13 15:07 cykl

Thanks for running the benchmarks and it sounds like Cliff is willing to address any performance issues that are uncovered. So I think if we can get a performance neutral (or even positive) solution than this works for me.

SIMD RegisterSets would be great.

abramsm avatar Jul 03 '13 15:07 abramsm

Here are the preliminary results. Please remember these are micro-benchmarks and it is very easy to get them wrong or meaningless. I tried to check the overhead in single thread mode and the scalability up to 4 threads (I don't have access to a bigger machine now). I also implemented a very basic locking strategy.

I ran the multi-threaded benchmark on master branch too, even if this is obviously broken. It sets a kind of upper bound.

I am using JMH, it is a great micro & macro benchmark framework from the OpenJDK guys.

Offer

Master:

Benchmark                Thr    Cnt  Sec         Mean   Mean error          Var    Units
b.g.t.LockHLL.upstream     1      3    5    48358.123     2645.948   213217.459 ops/msec
b.g.t.LockHLL.upstream     2      3    5    94327.262      485.268     7171.734 ops/msec
b.g.t.LockHLL.upstream     3      3    5   137082.131     1572.241    75283.261 ops/msec
b.g.t.LockHLL.upstream     4      3    5   170813.806    18004.916  9872846.937 ops/msec

With Locks:

Benchmark                   Thr    Cnt  Sec         Mean   Mean error          Var    Units
b.g.t.LockHLL.synchonized     1      3    5    34995.571     3507.605   374698.117 ops/msec
b.g.t.LockHLL.synchonized     2      3    5    11022.723     1441.353    63270.469 ops/msec
b.g.t.LockHLL.synchonized     3      3    5    13060.744     1349.222    55440.481 ops/msec
b.g.t.LockHLL.synchonized     4      3    5     6762.264     2861.360   249347.562 ops/msec

Atomic Integer:

Benchmark              Thr    Cnt  Sec         Mean   Mean error          Var    Units
b.g.t.LockHLL.atomic     1      3    5    44816.308     3039.758   281409.213 ops/msec
b.g.t.LockHLL.atomic     2      3    5    87836.623      335.387     3425.729 ops/msec
b.g.t.LockHLL.atomic     3      3    5   128128.033      485.808     7187.686 ops/msec
b.g.t.LockHLL.atomic     4      3    5   159221.834     2617.682   208686.362 ops/msec

As you can see:

  • Locks sucks and do not scale (game over, I discarded locks after this single test)
  • Atomic Integer performance is good and scales well

Cardinality

Master:

Benchmark                                            Thr    Cnt  Sec         Mean   Mean error          Var    Units
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty      1      3    5    49261.078     1059.919    34214.156  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty      2      3    5    95714.322     1349.342    55450.317  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty      3      3    5   139771.044     2171.358   143589.665  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty      4      3    5   182426.222      304.713     2827.756  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge      1      3    5    51575.261       37.447       42.707  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge      2      3    5   100071.983      124.791      474.270  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge      3      3    5   145895.356     1852.239   104485.031  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge      4      3    5   189956.789      281.272     2409.418  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium     1      3    5    51046.894      827.130    20835.670  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium     2      3    5    99627.911      512.902     8011.779  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium     3      3    5   145333.322      608.684    11283.511  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium     4      3    5   188097.172     9235.576  2597695.224  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall      1      3    5    50542.467      485.736     7185.559  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall      2      3    5    98418.889      597.898    10887.155  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall      3      3    5   143860.939      466.348     6623.390  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall      4      3    5   187057.500      151.191      696.163  ops/sec

Atomic:

Benchmark                                            Thr    Cnt  Sec         Mean   Mean error          Var    Units
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty      1      3    5    52361.422      112.263      383.826  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty      2      3    5   101250.678      857.797    22409.327  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty      3      3    5   147446.422     3290.138   329676.765  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityEmpty      4      3    5   192813.361      320.052     3119.610  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge      1      3    5    52980.911      149.175      677.723  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge      2      3    5   103238.939      734.321    16422.211  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge      3      3    5   150828.600      119.941      438.120  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityLarge      4      3    5   196309.850      445.951     6056.688  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium     1      3    5    47801.022     1094.554    36486.721  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium     2      3    5    92897.294     2136.041   138956.628  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium     3      3    5   135471.806     1755.278    93832.231  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalityMedium     4      3    5   176407.472      528.488     8506.093  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall      1      3    5    49781.144      200.425     1223.388  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall      2      3    5    97032.644      697.386    14811.769  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall      3      3    5   141611.350       91.775      256.514  ops/sec
b.g.t.HLLCardinalityBenchmark.hllCardinalitySmall      4      3    5   184142.089     1593.172    77301.121  ops/sec

Performance is good and scales well

Merge

Master:

Benchmark                          Thr    Cnt  Sec         Mean   Mean error          Var    Units
b.g.t.HLLMergeBenchmark.hllMerge     1      3    5      237.547       12.459        4.727 ops/msec
b.g.t.HLLMergeBenchmark.hllMerge     2      3    5      459.301       15.642        7.451 ops/msec
b.g.t.HLLMergeBenchmark.hllMerge     3      3    5      680.184       20.341       12.601 ops/msec
b.g.t.HLLMergeBenchmark.hllMerge     4      3    5      854.049        5.401        0.888 ops/msec

Atomic:

Benchmark                          Thr    Cnt  Sec         Mean   Mean error          Var    Units
b.g.t.HLLMergeBenchmark.hllMerge     1      3    5       70.912        7.946        1.923 ops/msec
b.g.t.HLLMergeBenchmark.hllMerge     2      3    5      138.540        3.018        0.277 ops/msec
b.g.t.HLLMergeBenchmark.hllMerge     3      3    5      202.193        3.205        0.313 ops/msec
b.g.t.HLLMergeBenchmark.hllMerge     4      3    5      236.031      116.845      415.796 ops/msec

Awful performance. Scalability seems correct. I did not have the time to perform a clean investigation. According to hprof/flamegraph it seems that the merge function is not inlined by the JIT with Atomic but it does not explain the whole story. I will try to find time to check the assembler output and CPU cache hit/miss to get a better overview of the issue.

BTW: I did not reviewed the code but I also noticed their was a bug in the implementation leading to wrong result when the update is contended (word is not reset to zero if CAS fails)

Conclusion

  • Overall scalability is good and performance too for offer and cardinality. There is a high penalty on merge, we have to check that with a macro-benchmark. I will run Cliff's branch on one of our job to see if the same behaviour is observed. I someone else could do the same it would be good to cross our results.
  • I'm a bit worried about the global semantic of a such change. For example HLL++ must be thread safe in sparse mode, operations implemented in HLL(++) too. A clear semantic must be defined, documented and ensured for each operation
  • It could be confusing to have a single thread-safe estimator while the others are not.
  • We should check that it will not prevent us to write a native high performance implementation of merge (read C++/JNI) since I'm not aware of any foreseeable change regarding SIMD vs Hotspot.
  • We have to be sure we are not adding too much complexity for a very specific use case, perhaps we can find a better design to allow easier customization.

cykl avatar Jul 03 '13 20:07 cykl

First, very nice work on the benchmarks​.

Regarding the cost of merge, I don't see a huge cost since the merge cost will be amortized across many, many additions.

tdunning avatar Jul 03 '13 20:07 tdunning

Hey great work on the benches. Glad to see it performing well. I agree the HLL++ will need tests to get threadsafed. I would also be game to take a crack at threadsafing countmin as well. As for the merge performance I'd be willing to bet it's due to the eliding of System.arraycopy from bits(). Will think on how to improve this.

moonpolysoft avatar Jul 03 '13 22:07 moonpolysoft

Also would you mind sharing your benchmarking code? We've got test machines with 24 - 32 cores, would be nice to produce a big old spread of measurements across larger numbers of threads, as well as give the JVM a warmup period for JIT to kick in.

moonpolysoft avatar Jul 03 '13 22:07 moonpolysoft

Regarding the cost of merge, I don't see a huge cost since the merge cost will be amortized across many, many additions.

It depends of your use case. HLL(++) are often used in web analytics. You precompute things like unique visitors according bazillion of criteria (time-range, hierarchy etc.), and store them on disk. This phase can be done is parallel on many machines and scalability is easy. Then you merge the estimator to create the final report(s). It can be done on the fly or in a batch so you want fast response time or good throughput. The performance goal is to be IO bound, not CPU bound.

On my machine I achieve 300MB/s with the master branch but "only" 100MB/s with current Cliff's code (LOG2M = 14). So it's a regression if you have fast disks or network and want to keep your code simple and single-threaded. It is not a show-stopper but we should try to avoid this slow down if possible since it impacts real workloads.

        ICardinality agg = new HyperLogLog(LOG2M);

        try (FileInputStream fis = new FileInputStream(file)) {
            while (true) {
                int s = fis.read(buf, 0, len);
                if (s == -1) break;

                byteCount += s;
                agg = agg.merge(HyperLogLog.Builder.build(buf));
            }
        }

As for the merge performance I'd be willing to bet it's due to the eliding of System.arraycopy from bits(). Will think on how to improve this

I did not have the time to investigate further today but I don't think this is the issue. In my private repo I have added a mutable merge operation implemented like this:

+    public void merge(HyperLogLog other) throws HyperLogLogMergeException {
+        if (other.sizeof() != sizeof())
+        {
+            throw new HyperLogLogMergeException("Cannot merge estimators of different sizes");
+        }
+        registerSet.merge(other.registerSet);
+    }

A you can see it bypass the getBits call. I observe exactly the same performances that with the immutable merge method both on master and your branch.

BTW: Adding a mutable merge method would be great anyway.

Also would you mind sharing your benchmarking code?

I will upload it ASAP, most likely next week. It is rudimentary but could be a starting point to try to monitor/improve stream-lib performances.

cykl avatar Jul 04 '13 22:07 cykl

Ping on benchmarking code.

moonpolysoft avatar Jul 11 '13 05:07 moonpolysoft

Here

Keep in mind that the paint is wet. Comment, contribution & criticism are welcome. Let me known if you need help.

cykl avatar Jul 11 '13 17:07 cykl

Any update on this one? We are considering merging another branch from #38 into master soon. That branch improves the serialization and memory footprint of the HLL Plus considerably but will add more thread safety concerns. I'm happy to wait to merge the results from this issue into the other branch in order to avoid breaking thread safety if you are going to be able to push your changes soon.

abramsm avatar Jul 23 '13 21:07 abramsm

any update on this pull request ?

thanks

nebrera avatar Dec 25 '13 20:12 nebrera