logging-log4j2 LOG4J2-3068 - JCToolsBlockingQueueFactory can use MpscBlockingConsumerArrayQueue

https://issues.apache.org/jira/browse/LOG4J2-3068

Apr 09 '21 05:04 franz1981

Bumping JCTools version is necessary due to some recent fixes around one of the used qs.

It's a shame we're exposing a blocking q API there, or we could've used https://github.com/JCTools/JCTools/issues/340 to batch consume from the queue.

Apr 09 '21 05:04 franz1981

@franz1981, thanks so much for the contribution. Would you mind updating changes.xml too, please?

Apr 09 '21 14:04 vy

Thanks for the contribution @franz1981, your PRs are appreciated as always!

I normally handle the changes.xml pieces when I merge PRs since it can be non-obvious (I've certainly gotten it wrong a few times, and I don't think it makes sense to make contributors learn our system).

Apr 09 '21 14:04 carterkozak

It's a shame we're exposing a blocking q API there, or we could've used JCTools/JCTools#340 to batch consume from the queue.

We could support a lighter weight API with the ability to utilize both BlockingQueue (retaining backwards compatibility) as well as https://github.com/JCTools/JCTools/issues/340, what do you think? Alternatively a separate JCToolsAsyncAppender is an option.

Apr 09 '21 15:04 carterkozak

@jvz

Can you add a manual entry for this as well?

Done!

@carterkozak

Alternatively a separate JCToolsAsyncAppender is an option.

I would go into this direction instead :) TBH the existing disruptor solution is a better fit in term of features set:

FIFO (I suppose that's mandatory right? If we just need FIFO for the same thread, there are much faster options on JCTools ala https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpscCompoundQueue.java#L136 ie using a fixed size of queues to distribute contention on the tail)
Object pool: this is unbeatable
batch-friendly: no idea how much this feature is used on log4j, but it can be rather important while appending files to batch writes syscalls (on FileChannel pwrite).

Thinking about a JCToolsAsyncAppender I still need to understand if being bounded is such a must have: I'm a great fan of backpressure, really, but with logging...is different. I see different strategies that could be employed to satisfy different application needs and AsyncQueueFullPolicy already expose some of these: I still see a value in using an unbounded q especially if compared with policies that would change the logs order or block the application threads.

If the unbounded q path is feasable, on JCTools we have some option that perform great under heavy contention eg xadd qs. The nice thing is that a "mild form" of backpressure is still possible with such alternativies by carefully choosing how to replicate the "object pooling" feature of the Disruptor and use the fixed capacity of such pool to limit the backlog on the unbounded q.

Apr 10 '21 14:04 franz1981

Our fully async disruptor implementation is indeed very difficult to beat in benchmarks, largely for the reasons you've mentioned, also due to the tighter control of the threading model we get by integrating it directly into the logger system (See AsyncLoggerContext, AsyncLogger) rather than applying it to individual appenders (which requires defensive copies of LogEvent objects). This makes it more difficult to like-for-like compare a jctools implementation to the disruptor implementation.

FIFO (I suppose that's mandatory right? If we just need FIFO for the same thread, there are much faster options on JCTools ala https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpscCompoundQueue.java#L136 ie using a fixed size of queues to distribute contention on the tail)

Mandatory? That's difficult to say. In the general case I'd say so, however many services push events into complex logging pipelines which handle event ordering based on timestamps or some other identifier, so it doesn't matter so much what order events are produced in (depending on the system, of course!)

Object pool: this is unbeatable

Agreed, this is a big one for us.

batch-friendly: no idea how much this feature is used on log4j, but it can be rather important while appending files to batch writes syscalls (on FileChannel pwrite).

The default configuration doesn't take advantage of this property, however configuring most appenders immediateFlush property allows them to wait until a LogEvent with endOfBatch=true before flushing, in my experience this increases logging throughput by an order of magnitude.

This reminds me, I had meant to research the feasibility of something like libaio for logging file i/o, I recall that you may have some experience with the java bindings? ;-) I imagine eventually io_uring may allow us to batch appends to multiple files into a single syscall, this could be huge as it's common to have some subset of events recorded to multiple files, but I admittedly haven't read as much about how this works in practice as I should.

I still see a value in using an unbounded q especially if compared with policies that would change the logs order or block the application threads.

Thar be dragons! It would be remarkably easy to OOM with an unbounded queue if used incorrectly -- in production systems it's often safer to increase the ringbuffer size and use a discarding policy when the queue is filled, but I don't mind allowing the option for an unbounded queue (on AsyncAppender one can already configure a LinkedBlockingQueue with Integer.MAX_VALUE capacity!). I do like the idea of a queue with a dynamic footprint, at work I use a large constant size ringbuffer which accounts for a large chunk of system memory utilization regardless of load or expectations for the service.

Apr 11 '21 14:04 carterkozak

This reminds me, I had meant to research the feasibility of something like libaio for logging file i/o, I recall that you may have some experience with the java bindings? ;-) I imagine eventually io_uring may allow us to batch appends to multiple files into a single syscall, this could be huge as it's common to have some subset of events recorded to multiple files, but I admittedly haven't read as much about how this works in practice as I should.

libaio need some JNI, as io_uring, but IMO io_uring is to be preferred because more modern. If it worths I've already implemented a version of libaio that minimize the usage of JNI (in favour of Unsafe now and VarHandle/Foreign Access in the future) ie https://github.com/apache/activemq-artemis-native/pull/10 and that can be used (with the right changes) as POC to see if it makes sense. Given the nature of being write-only, probably a logger library is a good fit, hence being able to NOT trigger the OS page cache can bring some benefit.

Apr 11 '21 15:04 franz1981

@carterkozak do we have some benchmark to run this PR vs the baseline and see if it worths? Sorry to ask directly, but I don't have any decent HW but my laptop at hand in this very moment..

Apr 14 '21 11:04 franz1981

Of course, I’d be happy to run the benchmarks! Unfortunately I’m not sure I’ll be able to get to it this week with other things going on.

Apr 14 '21 23:04 carterkozak

Sorry for the delay, @franz1981!

This PR rebased to master

java -jar log4j-perf/target/benchmarks.jar ".*AsyncAppenderLog4j2Benchmark\.throughputSimple*"

Benchmark                                                                   (configFileName)   Mode  Cnt        Score        Error  Units
AsyncAppenderLog4j2Benchmark.throughputSimple           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt    4  1454918.276 ± 233834.928  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt    4  2386789.891 ±  26025.458  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple      perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt    4  2754800.649 ±  27636.937  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple  perf5AsyncApndMpscParkQNoLoc-noOpAppender.xml  thrpt    4  2172734.114 ±  33748.888  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt    4  2360362.123 ±  54748.437  ops/s

With more threads: java -jar log4j-perf/target/benchmarks.jar ".*AsyncAppenderLog4j2Benchmark\.throughputSimple*" -t 28

Benchmark                                                                   (configFileName)   Mode  Cnt        Score        Error  Units
AsyncAppenderLog4j2Benchmark.throughputSimple           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt    4    98325.087 ±  14675.006  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt    4  1351814.397 ± 283184.236  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple      perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt    4  1337458.477 ± 686440.041  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple  perf5AsyncApndMpscParkQNoLoc-noOpAppender.xml  thrpt    4  1934682.312 ± 157883.496  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt    4  1037054.911 ± 317250.949  ops/s

Master without this PR

java -jar log4j-perf/target/benchmarks.jar ".*AsyncAppenderLog4j2Benchmark\.throughputSimple*"

Benchmark                                                                   (configFileName)   Mode  Cnt        Score        Error  Units
AsyncAppenderLog4j2Benchmark.throughputSimple           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt    4  1464295.244 ± 196064.898  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt    4  2196993.122 ± 732245.757  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple      perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt    4  2542884.245 ± 233213.133  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple  perf5AsyncApndMpscParkQNoLoc-noOpAppender.xml  thrpt    4  3173527.805 ± 107428.527  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt    4  2179666.490 ±  83951.441  ops/s

With more threads: java -jar log4j-perf/target/benchmarks.jar ".*AsyncAppenderLog4j2Benchmark\.throughputSimple*" -t 28

Benchmark                                                                   (configFileName)   Mode  Cnt        Score         Error  Units
AsyncAppenderLog4j2Benchmark.throughputSimple           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt    4   105100.502 ±   17273.746  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt    4  1075800.634 ±  278110.477  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple      perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt    4  1517297.622 ±  180276.994  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple  perf5AsyncApndMpscParkQNoLoc-noOpAppender.xml  thrpt    4  3613600.519 ±  102724.103  ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt    4  1100745.644 ± 2014574.543  ops/s

Benchmark Implementation

Draft PR with my benchmark changes used for comparison here: #556

Aug 03 '21 21:08 carterkozak

This was closed automatically by Github because we renamed the release-2.x branch to 2.x. Feel free to resubmit to the 2.x branch.

Mar 01 '23 07:03 ppkarwasz

logging-log4j2 logging-log4j2 copied to clipboard

LOG4J2-3068 - JCToolsBlockingQueueFactory can use MpscBlockingConsumerArrayQueue

This PR rebased to master

Master without this PR

Benchmark Implementation

logging-log4j2
logging-log4j2 copied to clipboard