logging-log4j2
logging-log4j2 copied to clipboard
LOG4J2-3068 - JCToolsBlockingQueueFactory can use MpscBlockingConsumerArrayQueue
https://issues.apache.org/jira/browse/LOG4J2-3068
Bumping JCTools version is necessary due to some recent fixes around one of the used qs.
It's a shame we're exposing a blocking q API there, or we could've used https://github.com/JCTools/JCTools/issues/340 to batch consume from the queue.
@franz1981, thanks so much for the contribution. Would you mind updating changes.xml too, please?
Thanks for the contribution @franz1981, your PRs are appreciated as always!
I normally handle the changes.xml pieces when I merge PRs since it can be non-obvious (I've certainly gotten it wrong a few times, and I don't think it makes sense to make contributors learn our system).
It's a shame we're exposing a blocking q API there, or we could've used JCTools/JCTools#340 to batch consume from the queue.
We could support a lighter weight API with the ability to utilize both BlockingQueue (retaining backwards compatibility) as well as https://github.com/JCTools/JCTools/issues/340, what do you think? Alternatively a separate JCToolsAsyncAppender is an option.
@jvz
Can you add a manual entry for this as well?
Done!
@carterkozak
Alternatively a separate JCToolsAsyncAppender is an option.
I would go into this direction instead :) TBH the existing disruptor solution is a better fit in term of features set:
- FIFO (I suppose that's mandatory right? If we just need FIFO for the same thread, there are much faster options on JCTools ala https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpscCompoundQueue.java#L136 ie using a fixed size of queues to distribute contention on the tail)
- Object pool: this is unbeatable
- batch-friendly: no idea how much this feature is used on log4j, but it can be rather important while appending files to batch writes syscalls (on
FileChannelpwrite).
Thinking about a JCToolsAsyncAppender I still need to understand if being bounded is such a must have: I'm a great fan of backpressure, really, but with logging...is different.
I see different strategies that could be employed to satisfy different application needs and AsyncQueueFullPolicy already expose some of these: I still see a value in using an unbounded q especially if compared with policies that would change the logs order or block the application threads.
If the unbounded q path is feasable, on JCTools we have some option that perform great under heavy contention eg xadd qs.
The nice thing is that a "mild form" of backpressure is still possible with such alternativies by carefully choosing how to replicate the "object pooling" feature of the Disruptor and use the fixed capacity of such pool to limit the backlog on the unbounded q.
Our fully async disruptor implementation is indeed very difficult to beat in benchmarks, largely for the reasons you've mentioned, also due to the tighter control of the threading model we get by integrating it directly into the logger system (See AsyncLoggerContext, AsyncLogger) rather than applying it to individual appenders (which requires defensive copies of LogEvent objects). This makes it more difficult to like-for-like compare a jctools implementation to the disruptor implementation.
- FIFO (I suppose that's mandatory right? If we just need FIFO for the same thread, there are much faster options on JCTools ala https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpscCompoundQueue.java#L136 ie using a fixed size of queues to distribute contention on the tail)
Mandatory? That's difficult to say. In the general case I'd say so, however many services push events into complex logging pipelines which handle event ordering based on timestamps or some other identifier, so it doesn't matter so much what order events are produced in (depending on the system, of course!)
- Object pool: this is unbeatable
Agreed, this is a big one for us.
- batch-friendly: no idea how much this feature is used on log4j, but it can be rather important while appending files to batch writes syscalls (on FileChannel pwrite).
The default configuration doesn't take advantage of this property, however configuring most appenders immediateFlush property allows them to wait until a LogEvent with endOfBatch=true before flushing, in my experience this increases logging throughput by an order of magnitude.
This reminds me, I had meant to research the feasibility of something like libaio for logging file i/o, I recall that you may have some experience with the java bindings? ;-) I imagine eventually io_uring may allow us to batch appends to multiple files into a single syscall, this could be huge as it's common to have some subset of events recorded to multiple files, but I admittedly haven't read as much about how this works in practice as I should.
I still see a value in using an unbounded q especially if compared with policies that would change the logs order or block the application threads.
Thar be dragons! It would be remarkably easy to OOM with an unbounded queue if used incorrectly -- in production systems it's often safer to increase the ringbuffer size and use a discarding policy when the queue is filled, but I don't mind allowing the option for an unbounded queue (on AsyncAppender one can already configure a LinkedBlockingQueue with Integer.MAX_VALUE capacity!). I do like the idea of a queue with a dynamic footprint, at work I use a large constant size ringbuffer which accounts for a large chunk of system memory utilization regardless of load or expectations for the service.
This reminds me, I had meant to research the feasibility of something like libaio for logging file i/o, I recall that you may have some experience with the java bindings? ;-) I imagine eventually io_uring may allow us to batch appends to multiple files into a single syscall, this could be huge as it's common to have some subset of events recorded to multiple files, but I admittedly haven't read as much about how this works in practice as I should.
libaio need some JNI, as io_uring, but IMO io_uring is to be preferred because more modern. If it worths I've already implemented a version of libaio that minimize the usage of JNI (in favour of Unsafe now and VarHandle/Foreign Access in the future) ie https://github.com/apache/activemq-artemis-native/pull/10 and that can be used (with the right changes) as POC to see if it makes sense. Given the nature of being write-only, probably a logger library is a good fit, hence being able to NOT trigger the OS page cache can bring some benefit.
@carterkozak do we have some benchmark to run this PR vs the baseline and see if it worths? Sorry to ask directly, but I don't have any decent HW but my laptop at hand in this very moment..
Of course, I’d be happy to run the benchmarks! Unfortunately I’m not sure I’ll be able to get to it this week with other things going on.
Sorry for the delay, @franz1981!
This PR rebased to master
java -jar log4j-perf/target/benchmarks.jar ".*AsyncAppenderLog4j2Benchmark\.throughputSimple*"
Benchmark (configFileName) Mode Cnt Score Error Units
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndNoLoc-noOpAppender.xml thrpt 4 1454918.276 ± 233834.928 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 4 2386789.891 ± 26025.458 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 4 2754800.649 ± 27636.937 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscParkQNoLoc-noOpAppender.xml thrpt 4 2172734.114 ± 33748.888 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 4 2360362.123 ± 54748.437 ops/s
With more threads:
java -jar log4j-perf/target/benchmarks.jar ".*AsyncAppenderLog4j2Benchmark\.throughputSimple*" -t 28
Benchmark (configFileName) Mode Cnt Score Error Units
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndNoLoc-noOpAppender.xml thrpt 4 98325.087 ± 14675.006 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 4 1351814.397 ± 283184.236 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 4 1337458.477 ± 686440.041 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscParkQNoLoc-noOpAppender.xml thrpt 4 1934682.312 ± 157883.496 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 4 1037054.911 ± 317250.949 ops/s
Master without this PR
java -jar log4j-perf/target/benchmarks.jar ".*AsyncAppenderLog4j2Benchmark\.throughputSimple*"
Benchmark (configFileName) Mode Cnt Score Error Units
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndNoLoc-noOpAppender.xml thrpt 4 1464295.244 ± 196064.898 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 4 2196993.122 ± 732245.757 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 4 2542884.245 ± 233213.133 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscParkQNoLoc-noOpAppender.xml thrpt 4 3173527.805 ± 107428.527 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 4 2179666.490 ± 83951.441 ops/s
With more threads:
java -jar log4j-perf/target/benchmarks.jar ".*AsyncAppenderLog4j2Benchmark\.throughputSimple*" -t 28
Benchmark (configFileName) Mode Cnt Score Error Units
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndNoLoc-noOpAppender.xml thrpt 4 105100.502 ± 17273.746 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 4 1075800.634 ± 278110.477 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 4 1517297.622 ± 180276.994 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscParkQNoLoc-noOpAppender.xml thrpt 4 3613600.519 ± 102724.103 ops/s
AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 4 1100745.644 ± 2014574.543 ops/s
Benchmark Implementation
Draft PR with my benchmark changes used for comparison here: #556
This was closed automatically by Github because we renamed the release-2.x branch to 2.x. Feel free to resubmit to the 2.x branch.