opentelemetry-java icon indicating copy to clipboard operation
opentelemetry-java copied to clipboard

(Further) reduce CPU usage of batch span processor

Open franz1981 opened this issue 3 years ago • 15 comments

That's a follow-up issue of https://github.com/open-telemetry/opentelemetry-java/issues/2968

Same purpose, but as mentioned on https://github.com/open-telemetry/opentelemetry-java/issues/2968#issuecomment-864370074

I see that the referenced commit is using a CLQ with a separate size tracker to overcome the size complexity of CLQ; I strongly suggest to use MpscUnboundedXaddArrayQueue on JCTools that offer:

(order of magnitude) better behaviour then any compare-and-set queue alternative, under contention GC free if correctly configured or lower allocation rate then CLQ O(1)-ish size although I suggest to try different strategies: size is linearish but it would invalidate producer and consumer sequences if queried before any offer, slowing down a streaming consumer The latter suggest that using a LongAdder or an atomic counter of JCTools is a better choice

@sbandadd too

franz1981 avatar Jun 21 '21 16:06 franz1981

The referenced commit in the issue was used only for benchmarks. The merged PR's used MPSCArrayQueue https://github.com/open-telemetry/opentelemetry-java/blob/50408d499f85d5761d0a5ed9bf9d77d5ff01fff5/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessor.java#L75.

@franz1981 could you run the benchmarks the benchmark with the current queue and the new queue you are proposing?

sbandadd avatar Jun 21 '21 16:06 sbandadd

Not as good as I wished: image

Consider that I've used benchmark setting mpsc array q of capacity == spanCount * 5 (where 5 is the number of threads calling export).

Just FYI, burst offer scalability benchmark with a xeon server that compare mpsc array q vs xadd q will get: image

That's why I'm a bit surprised about these results.

The mpsc array capacity is correct? Or should be less or more?

Probably worth playing a bit with perfnorm/perfasm/async-profiler to check what's going on; there are a couple of things on the implementation that doesn't look "right" related awaking, but probably need to better study the impl

This is the branch I've prepared: https://github.com/franz1981/opentelemetry-java/tree/xadd_batch_span

franz1981 avatar Jun 22 '21 13:06 franz1981

I've investigated with async-profiler what's going on here...

mpsc array q: 412 samples with 100 Hz profiler -> ~42% CPU usage image

mpsc xadd array q: 282 samples with 100 Hz profiler -> ~28% CPU usage image

Due to skidding, in both cases the contention seems not a thing (both xadd and cas seems innocent, but it's impossible) and probably perfasm can tell a different story.

Re the improvement: I see that the CPU usage improvement is paired with a boost in throughput, hence seems a nice win.

Some bad news: in both cases size is taking too much time, probably because it both invalidate producer and consumer sequences. We need to get rid of that bottleneck first to see both queues delivering the best they can do. using size on both offer and consumer side prevent the queues to scale as expected (especially the xadd one) causing unnecessary cache coherency traffic.

franz1981 avatar Jun 22 '21 15:06 franz1981

@franz1981 Is this something that can be improved in the jctools implementation itself? It's not intuitive to me why calling size could have any dramatic effect on performance

anuraaga avatar Jun 23 '21 01:06 anuraaga

@anuraaga Sadly not: it's something related to how caches works on machine; we just need to reduce the dependency from the size operation to make progresses...

franz1981 avatar Jun 23 '21 04:06 franz1981

Feel free to review my branch and try it on a bigger server to see if you get the same or better improvement.

franz1981 avatar Jun 23 '21 04:06 franz1981

@franz1981 The benchmark I pasted earlier has flush overhead. Let's try this one, this benchmark is what I used earlier to surface contention issues.

sbandadd avatar Jun 23 '21 04:06 sbandadd

@anuraaga Thanks! I don't have a big machine to try it, but just a tuned i7 (without turbo boost, disabling frequency scaling etc etc) with 6 hw threads (12 with HT). I am going to try it, but feel free to try it on a big server If you can or the producers contention won't be exercised for real

franz1981 avatar Jun 23 '21 04:06 franz1981

@anuraaga i have given a shot to the mentioned benchmark but it doesn't seem to stress the queue contention, because the XADD queue is unbounded and the test doesn't start at steady state (as the previous one). The consumer just won't keep up with the producers and the benchmark end up spending most of its time on the garbage collector instead... I think that to make this to work I should implement some artificial limit to the queue growth first.

Just a semantic question: the queue have to obey a total FIFO order or each producer must be consumed in FIFO ordering but they can see their items consumed in different orders?

franz1981 avatar Jun 24 '21 05:06 franz1981

Thanks for the investigation here @franz1981 happy to make changes if anything can speed things up even more. Order on the consumer side does not matter as long as all the items that have been produced are consumed.

anuraaga avatar Jun 24 '21 06:06 anuraaga

@anuraaga In this case on JCTools we have an unbeateable queue for this: https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpscCompoundQueue.java

It uses a mesh of bounded queue to spread the contention: it should be the right queue for the job, without re-implementing the logic to have an uneeded total ordering

franz1981 avatar Jun 24 '21 06:06 franz1981

@franz1981 IIRC, I tried the compound queue for that reason but couldn't find any noticable difference in our benchmarks on my laptop, so went with the simpler non-compound one. My main concern with the compound queue was our friend the size() method would have less reliable performance, possibly scaling down with the number of CPUs. It could be needing a different benchmark though, but at least with our current ones I couldn't justify using the compound queue.

anuraaga avatar Jun 24 '21 06:06 anuraaga

@anuraaga

I tried the compound queue for that reason but couldn't find any noticable difference in our benchmarks on my laptop, so went with the simpler non-compound one

And I'm seeing the same here: so it just means that the benchmark isn't stressing contention against the q (because the 2 qs behave very differently under contention, by design) and/or as you rightly pointed out, the size is just a major factor although profile data won't show it clearly TBH. If contention is a thing ie it's a factor that you've found on realistic load to be important (using the right profiler tool) then I'm sure it worths to move on compound unless there are other bottlenecks in place. I'm going to prepare a couple of flamegraphs to show what I mean about it

franz1981 avatar Jun 24 '21 07:06 franz1981

I've used a separate soft/lazy counter on https://github.com/franz1981/opentelemetry-java/tree/xadd_batch_span and I'm getting:

image

Consider that I don't know yet if the logic is sound

franz1981 avatar Jun 25 '21 07:06 franz1981

And for 20 threads (that doesn't makes sense on my poor 12 cores laptop): image

Probably worth trying this one in an appropriate bigger machine IMO: consider that I didn't isolate the soft size tracker in its own cache line, hence it should suffer from noisy neighbors on the heap ie false sharing

And I would like to run again the same tests of https://github.com/open-telemetry/opentelemetry-java/issues/3338#issuecomment-866006077 again: I suppose that anything such precise to track a size will cause contention to become a major factor with the right benchmark, and looking at the results I believe the benchmark on https://github.com/open-telemetry/opentelemetry-java/issues/3338#issuecomment-866514933 to not be that appropriate: it cause contention on the q until it start to drop frames because of too much load...i don't know either Id that's a realistic use case.

Bursty scenarios seems more realistic then streaming (with looooong streaming intervals), but I am not an expert of the framework, so I could be wrong.

I would invite @anuraaga to run my branch on their pre-prod/test env on real machines to check if there is some realistic improvements in a real world use case, then we can think on how to capture this in a synthetic benchmark ;)

franz1981 avatar Jun 25 '21 07:06 franz1981