vert.x
vert.x copied to clipboard
Improve fairness of SimpleConnectionPool by yielding its combiner executor
Motivation:
SimpleConnectionPool
is backed by a CombinerExecutor
that can be caught in a tight (potentially endless) loop while consuming (and executing) actions, occupying its current running event loop for too long.
This PR add both duration-based and max action count based yield strategies with some opportunistic action resume ie if a resume continuation is emitted, any concurrent "lucky" resume/submit can still consume the existing actions backlog.
Current logic is not enforcing an issued resume to NOT be emitted again regardless the state of any previously submitted one - that means that a "lucky" concurrent submit can consume the actions backlog and emit/not emit any continuation "again" capturing the current state of backlong, instead of the current state of scheduled resume(s).
The same can happen if a resume
is executed and won't complete its job; given that it doesn't know anything about pending submit(s), it risks to compete with them to consume the backlog.
The main purpose of yield is to guarantee that a single submit cannot occupy a single event loop for too much time (to be defined); but it doesn't mean that subsequent ones (on the same event loops/other event loops) cannot keep on executing actions, making some progress.
Conversely, the "yield" state can be turned in a proper combiner state that only a resume operation is allowed to unblock - meaning that any submit before it won't be allowed to pick-up the action backlog (maybe causing it to become too large!): these are 2 different ways to solve the same problem, each one with their own pros/cons.
@vietj I've changed the benchmark and something that requires investigation has happened:
Benchmark (actionWork) (type) Mode Cnt Score Error Units
CombinerExecutorBenchmark.submitFourThreads 0 sync thrpt 20 23.363 ± 1.548 ops/us
CombinerExecutorBenchmark.submitFourThreads:completed 0 sync thrpt 20 23.365 ± 1.548 ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted 0 sync thrpt 20 23.365 ± 1.548 ops/us
CombinerExecutorBenchmark.submitFourThreads 0 combiner thrpt 20 9.419 ± 0.455 ops/us
CombinerExecutorBenchmark.submitFourThreads:completed 0 combiner thrpt 20 9.423 ± 0.455 ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted 0 combiner thrpt 20 9.423 ± 0.455 ops/us
CombinerExecutorBenchmark.submitFourThreads 1 sync thrpt 20 23.940 ± 0.486 ops/us
CombinerExecutorBenchmark.submitFourThreads:completed 1 sync thrpt 20 23.944 ± 0.486 ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted 1 sync thrpt 20 23.944 ± 0.486 ops/us
CombinerExecutorBenchmark.submitFourThreads 1 combiner thrpt 20 8.640 ± 0.578 ops/us
CombinerExecutorBenchmark.submitFourThreads:completed 1 combiner thrpt 20 8.646 ± 0.578 ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted 1 combiner thrpt 20 8.646 ± 0.578 ops/us
CombinerExecutorBenchmark.submitFourThreads 10 sync thrpt 20 18.615 ± 0.569 ops/us
CombinerExecutorBenchmark.submitFourThreads:completed 10 sync thrpt 20 18.617 ± 0.569 ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted 10 sync thrpt 20 18.617 ± 0.569 ops/us
CombinerExecutorBenchmark.submitFourThreads 10 combiner thrpt 20 9.274 ± 0.644 ops/us
CombinerExecutorBenchmark.submitFourThreads:completed 10 combiner thrpt 20 9.279 ± 0.644 ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted 10 combiner thrpt 20 9.279 ± 0.644 ops/us
CombinerExecutorBenchmark.submitFourThreads 20 sync thrpt 20 12.842 ± 0.643 ops/us
CombinerExecutorBenchmark.submitFourThreads:completed 20 sync thrpt 20 12.843 ± 0.643 ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted 20 sync thrpt 20 12.843 ± 0.643 ops/us
CombinerExecutorBenchmark.submitFourThreads 20 combiner thrpt 20 8.704 ± 0.127 ops/us
CombinerExecutorBenchmark.submitFourThreads:completed 20 combiner thrpt 20 8.708 ± 0.125 ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted 20 combiner thrpt 20 8.708 ± 0.125 ops/us
CombinerExecutorBenchmark.submitOneThread 0 sync thrpt 20 32.403 ± 0.239 ops/us
CombinerExecutorBenchmark.submitOneThread:completed 0 sync thrpt 20 32.403 ± 0.239 ops/us
CombinerExecutorBenchmark.submitOneThread:submitted 0 sync thrpt 20 32.403 ± 0.239 ops/us
CombinerExecutorBenchmark.submitOneThread 0 combiner thrpt 20 23.212 ± 0.205 ops/us
CombinerExecutorBenchmark.submitOneThread:completed 0 combiner thrpt 20 23.212 ± 0.205 ops/us
CombinerExecutorBenchmark.submitOneThread:submitted 0 combiner thrpt 20 23.212 ± 0.205 ops/us
CombinerExecutorBenchmark.submitOneThread 1 sync thrpt 20 32.159 ± 0.906 ops/us
CombinerExecutorBenchmark.submitOneThread:completed 1 sync thrpt 20 32.159 ± 0.906 ops/us
CombinerExecutorBenchmark.submitOneThread:submitted 1 sync thrpt 20 32.159 ± 0.906 ops/us
CombinerExecutorBenchmark.submitOneThread 1 combiner thrpt 20 22.781 ± 0.127 ops/us
CombinerExecutorBenchmark.submitOneThread:completed 1 combiner thrpt 20 22.781 ± 0.127 ops/us
CombinerExecutorBenchmark.submitOneThread:submitted 1 combiner thrpt 20 22.781 ± 0.127 ops/us
CombinerExecutorBenchmark.submitOneThread 10 sync thrpt 20 25.222 ± 0.251 ops/us
CombinerExecutorBenchmark.submitOneThread:completed 10 sync thrpt 20 25.222 ± 0.251 ops/us
CombinerExecutorBenchmark.submitOneThread:submitted 10 sync thrpt 20 25.222 ± 0.251 ops/us
CombinerExecutorBenchmark.submitOneThread 10 combiner thrpt 20 18.778 ± 0.257 ops/us
CombinerExecutorBenchmark.submitOneThread:completed 10 combiner thrpt 20 18.778 ± 0.257 ops/us
CombinerExecutorBenchmark.submitOneThread:submitted 10 combiner thrpt 20 18.778 ± 0.257 ops/us
CombinerExecutorBenchmark.submitOneThread 20 sync thrpt 20 17.348 ± 0.622 ops/us
CombinerExecutorBenchmark.submitOneThread:completed 20 sync thrpt 20 17.348 ± 0.622 ops/us
CombinerExecutorBenchmark.submitOneThread:submitted 20 sync thrpt 20 17.348 ± 0.622 ops/us
CombinerExecutorBenchmark.submitOneThread 20 combiner thrpt 20 14.599 ± 0.064 ops/us
CombinerExecutorBenchmark.submitOneThread:completed 20 combiner thrpt 20 14.599 ± 0.064 ops/us
CombinerExecutorBenchmark.submitOneThread:submitted 20 combiner thrpt 20 14.599 ± 0.064 ops/us
CombinerExecutorBenchmark.submitThreeThreads 0 sync thrpt 20 23.608 ± 1.872 ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed 0 sync thrpt 20 23.609 ± 1.872 ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted 0 sync thrpt 20 23.609 ± 1.872 ops/us
CombinerExecutorBenchmark.submitThreeThreads 0 combiner thrpt 20 11.596 ± 0.978 ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed 0 combiner thrpt 20 11.599 ± 0.978 ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted 0 combiner thrpt 20 11.599 ± 0.978 ops/us
CombinerExecutorBenchmark.submitThreeThreads 1 sync thrpt 20 22.583 ± 1.232 ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed 1 sync thrpt 20 22.585 ± 1.232 ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted 1 sync thrpt 20 22.585 ± 1.232 ops/us
CombinerExecutorBenchmark.submitThreeThreads 1 combiner thrpt 20 11.525 ± 0.867 ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed 1 combiner thrpt 20 11.529 ± 0.867 ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted 1 combiner thrpt 20 11.529 ± 0.867 ops/us
CombinerExecutorBenchmark.submitThreeThreads 10 sync thrpt 20 20.801 ± 0.873 ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed 10 sync thrpt 20 20.802 ± 0.873 ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted 10 sync thrpt 20 20.802 ± 0.873 ops/us
CombinerExecutorBenchmark.submitThreeThreads 10 combiner thrpt 20 11.474 ± 0.111 ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed 10 combiner thrpt 20 11.477 ± 0.111 ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted 10 combiner thrpt 20 11.477 ± 0.111 ops/us
CombinerExecutorBenchmark.submitThreeThreads 20 sync thrpt 20 13.453 ± 0.592 ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed 20 sync thrpt 20 13.454 ± 0.592 ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted 20 sync thrpt 20 13.454 ± 0.592 ops/us
CombinerExecutorBenchmark.submitThreeThreads 20 combiner thrpt 20 10.800 ± 0.693 ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed 20 combiner thrpt 20 10.803 ± 0.693 ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted 20 combiner thrpt 20 10.803 ± 0.693 ops/us
CombinerExecutorBenchmark.submitTwoThreads 0 sync thrpt 20 13.214 ± 1.548 ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed 0 sync thrpt 20 13.215 ± 1.548 ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted 0 sync thrpt 20 13.215 ± 1.548 ops/us
CombinerExecutorBenchmark.submitTwoThreads 0 combiner thrpt 20 21.587 ± 17.779 ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed 0 combiner thrpt 20 21.717 ± 18.038 ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted 0 combiner thrpt 20 21.717 ± 18.038 ops/us
CombinerExecutorBenchmark.submitTwoThreads 1 sync thrpt 20 12.017 ± 1.040 ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed 1 sync thrpt 20 12.018 ± 1.040 ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted 1 sync thrpt 20 12.018 ± 1.040 ops/us
CombinerExecutorBenchmark.submitTwoThreads 1 combiner thrpt 20 8.986 ± 1.087 ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed 1 combiner thrpt 20 8.989 ± 1.088 ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted 1 combiner thrpt 20 8.989 ± 1.088 ops/us
CombinerExecutorBenchmark.submitTwoThreads 10 sync thrpt 20 9.651 ± 0.986 ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed 10 sync thrpt 20 9.652 ± 0.986 ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted 10 sync thrpt 20 9.652 ± 0.986 ops/us
CombinerExecutorBenchmark.submitTwoThreads 10 combiner thrpt 20 10.229 ± 0.317 ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed 10 combiner thrpt 20 10.231 ± 0.317 ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted 10 combiner thrpt 20 10.231 ± 0.317 ops/us
CombinerExecutorBenchmark.submitTwoThreads 20 sync thrpt 20 9.041 ± 0.612 ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed 20 sync thrpt 20 9.041 ± 0.613 ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted 20 sync thrpt 20 9.041 ± 0.613 ops/us
CombinerExecutorBenchmark.submitTwoThreads 20 combiner thrpt 20 10.482 ± 0.209 ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed 20 combiner thrpt 20 10.490 ± 0.217 ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted 20 combiner thrpt 20 10.490 ± 0.217 ops/us
TLDR the semaphore based one seems to perform better (or slightly worse) then the combiner impl: maybe there's something obvious I'm missing but this requires some investigation
Sadly AuxCounters are collected and merged by JMH, and cannot report per-thread statistics (that would be nice); I can send a PR to JMH for that because it would be nice to know how the executions of tasks is distributed between threads - here I can create a counter but I the reporting won't say anything useful
I think we are not looking to compare this to the semaphore version because we are not comparing apples to apples. The semaphore version will create contention and potentially block all threads and the lock free version will not.
We should instead compare performance to the same implementation without a deadline to see that the changes will not increase too much the cost of the submission.
I think we are not looking to compare this to the semaphore version because we are not comparing apples to apples. The semaphore version will create contention and potentially block all threads and the lock free version will not.
It's fair, and indeed the benchmark itself is not comparing the same behaviour nor the kind of service we expect by the executor but still...it has highlighted an important bottleneck.
The semaphore version won't allow too many contended requests on the semaphore (because the winner will enter and the looser will park, without performing any CAS loop), while the JCTools chosen queue, instead, is using a CAS loop to submit the action; just switching it with the JCTools mpsc unbounded xadd q, performance will increase a lot given that who designed it (myself) tried hard to make it a better performer under high contention (on x86, while M1 Apple users and PowerPC won't benefit by this optimization).
This same scenario can happen in a real world case if the event loop acting as combiner got other N concurrent event loops competing to add to the action queue; they will end up slowing each others because of the contention (!)
I can show some number if you think that's a valid use case
These are the numbers with the master
benchmark (using 2 threads):
Benchmark Mode Cnt Score Error Units
CombinerExecutorBenchmark.impl thrpt 10 11484.804 ± 5100.280 ops/ms
These are by using the mpsc unbounded xadd q:
Benchmark Mode Cnt Score Error Units
CombinerExecutorBenchmark.impl thrpt 10 17208.869 ± 1450.893 ops/ms
With 8 threads:
Benchmark Mode Cnt Score Error Units
CombinerExecutorBenchmark.impl thrpt 10 6423.568 ± 361.666 ops/ms
8 threads using the mpsc unbounded xadd q:
Benchmark Mode Cnt Score Error Units
CombinerExecutorBenchmark.impl thrpt 10 27310.143 ± 3168.112 ops/ms
The CAS loop of the original queue is causing a diminishing return on a burst of requests due to the contention, while the xadd one isn't affected; there are other effects in place (q::poll is much cheaper on the cas q) that can affect the benchmark but the behaviour under contention is quite real.
This same scenario can happen in a real world case if the event loop acting as combiner got other N concurrent event loops competing to add to the action queue; they will end up slowing each others because of the contention (!)
In practice I think this is the case of a benchmark, we are talking of event-loop threads which have plenty of tasks to perform, including IO and that access a database pool to perform interactions with the database.