vert.x Improve fairness of SimpleConnectionPool by yielding its combiner executor

Motivation:

SimpleConnectionPool is backed by a CombinerExecutor that can be caught in a tight (potentially endless) loop while consuming (and executing) actions, occupying its current running event loop for too long.

This PR add both duration-based and max action count based yield strategies with some opportunistic action resume ie if a resume continuation is emitted, any concurrent "lucky" resume/submit can still consume the existing actions backlog.

Jul 06 '22 13:07 franz1981

Current logic is not enforcing an issued resume to NOT be emitted again regardless the state of any previously submitted one - that means that a "lucky" concurrent submit can consume the actions backlog and emit/not emit any continuation "again" capturing the current state of backlong, instead of the current state of scheduled resume(s). The same can happen if a resume is executed and won't complete its job; given that it doesn't know anything about pending submit(s), it risks to compete with them to consume the backlog.

The main purpose of yield is to guarantee that a single submit cannot occupy a single event loop for too much time (to be defined); but it doesn't mean that subsequent ones (on the same event loops/other event loops) cannot keep on executing actions, making some progress.

Conversely, the "yield" state can be turned in a proper combiner state that only a resume operation is allowed to unblock - meaning that any submit before it won't be allowed to pick-up the action backlog (maybe causing it to become too large!): these are 2 different ways to solve the same problem, each one with their own pros/cons.

Jul 06 '22 13:07 franz1981

@vietj I've changed the benchmark and something that requires investigation has happened:

Benchmark                                               (actionWork)    (type)   Mode  Cnt   Score    Error   Units
CombinerExecutorBenchmark.submitFourThreads                        0      sync  thrpt   20  23.363 ±  1.548  ops/us
CombinerExecutorBenchmark.submitFourThreads:completed              0      sync  thrpt   20  23.365 ±  1.548  ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted              0      sync  thrpt   20  23.365 ±  1.548  ops/us
CombinerExecutorBenchmark.submitFourThreads                        0  combiner  thrpt   20   9.419 ±  0.455  ops/us
CombinerExecutorBenchmark.submitFourThreads:completed              0  combiner  thrpt   20   9.423 ±  0.455  ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted              0  combiner  thrpt   20   9.423 ±  0.455  ops/us
CombinerExecutorBenchmark.submitFourThreads                        1      sync  thrpt   20  23.940 ±  0.486  ops/us
CombinerExecutorBenchmark.submitFourThreads:completed              1      sync  thrpt   20  23.944 ±  0.486  ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted              1      sync  thrpt   20  23.944 ±  0.486  ops/us
CombinerExecutorBenchmark.submitFourThreads                        1  combiner  thrpt   20   8.640 ±  0.578  ops/us
CombinerExecutorBenchmark.submitFourThreads:completed              1  combiner  thrpt   20   8.646 ±  0.578  ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted              1  combiner  thrpt   20   8.646 ±  0.578  ops/us
CombinerExecutorBenchmark.submitFourThreads                       10      sync  thrpt   20  18.615 ±  0.569  ops/us
CombinerExecutorBenchmark.submitFourThreads:completed             10      sync  thrpt   20  18.617 ±  0.569  ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted             10      sync  thrpt   20  18.617 ±  0.569  ops/us
CombinerExecutorBenchmark.submitFourThreads                       10  combiner  thrpt   20   9.274 ±  0.644  ops/us
CombinerExecutorBenchmark.submitFourThreads:completed             10  combiner  thrpt   20   9.279 ±  0.644  ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted             10  combiner  thrpt   20   9.279 ±  0.644  ops/us
CombinerExecutorBenchmark.submitFourThreads                       20      sync  thrpt   20  12.842 ±  0.643  ops/us
CombinerExecutorBenchmark.submitFourThreads:completed             20      sync  thrpt   20  12.843 ±  0.643  ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted             20      sync  thrpt   20  12.843 ±  0.643  ops/us
CombinerExecutorBenchmark.submitFourThreads                       20  combiner  thrpt   20   8.704 ±  0.127  ops/us
CombinerExecutorBenchmark.submitFourThreads:completed             20  combiner  thrpt   20   8.708 ±  0.125  ops/us
CombinerExecutorBenchmark.submitFourThreads:submitted             20  combiner  thrpt   20   8.708 ±  0.125  ops/us
CombinerExecutorBenchmark.submitOneThread                          0      sync  thrpt   20  32.403 ±  0.239  ops/us
CombinerExecutorBenchmark.submitOneThread:completed                0      sync  thrpt   20  32.403 ±  0.239  ops/us
CombinerExecutorBenchmark.submitOneThread:submitted                0      sync  thrpt   20  32.403 ±  0.239  ops/us
CombinerExecutorBenchmark.submitOneThread                          0  combiner  thrpt   20  23.212 ±  0.205  ops/us
CombinerExecutorBenchmark.submitOneThread:completed                0  combiner  thrpt   20  23.212 ±  0.205  ops/us
CombinerExecutorBenchmark.submitOneThread:submitted                0  combiner  thrpt   20  23.212 ±  0.205  ops/us
CombinerExecutorBenchmark.submitOneThread                          1      sync  thrpt   20  32.159 ±  0.906  ops/us
CombinerExecutorBenchmark.submitOneThread:completed                1      sync  thrpt   20  32.159 ±  0.906  ops/us
CombinerExecutorBenchmark.submitOneThread:submitted                1      sync  thrpt   20  32.159 ±  0.906  ops/us
CombinerExecutorBenchmark.submitOneThread                          1  combiner  thrpt   20  22.781 ±  0.127  ops/us
CombinerExecutorBenchmark.submitOneThread:completed                1  combiner  thrpt   20  22.781 ±  0.127  ops/us
CombinerExecutorBenchmark.submitOneThread:submitted                1  combiner  thrpt   20  22.781 ±  0.127  ops/us
CombinerExecutorBenchmark.submitOneThread                         10      sync  thrpt   20  25.222 ±  0.251  ops/us
CombinerExecutorBenchmark.submitOneThread:completed               10      sync  thrpt   20  25.222 ±  0.251  ops/us
CombinerExecutorBenchmark.submitOneThread:submitted               10      sync  thrpt   20  25.222 ±  0.251  ops/us
CombinerExecutorBenchmark.submitOneThread                         10  combiner  thrpt   20  18.778 ±  0.257  ops/us
CombinerExecutorBenchmark.submitOneThread:completed               10  combiner  thrpt   20  18.778 ±  0.257  ops/us
CombinerExecutorBenchmark.submitOneThread:submitted               10  combiner  thrpt   20  18.778 ±  0.257  ops/us
CombinerExecutorBenchmark.submitOneThread                         20      sync  thrpt   20  17.348 ±  0.622  ops/us
CombinerExecutorBenchmark.submitOneThread:completed               20      sync  thrpt   20  17.348 ±  0.622  ops/us
CombinerExecutorBenchmark.submitOneThread:submitted               20      sync  thrpt   20  17.348 ±  0.622  ops/us
CombinerExecutorBenchmark.submitOneThread                         20  combiner  thrpt   20  14.599 ±  0.064  ops/us
CombinerExecutorBenchmark.submitOneThread:completed               20  combiner  thrpt   20  14.599 ±  0.064  ops/us
CombinerExecutorBenchmark.submitOneThread:submitted               20  combiner  thrpt   20  14.599 ±  0.064  ops/us
CombinerExecutorBenchmark.submitThreeThreads                       0      sync  thrpt   20  23.608 ±  1.872  ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed             0      sync  thrpt   20  23.609 ±  1.872  ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted             0      sync  thrpt   20  23.609 ±  1.872  ops/us
CombinerExecutorBenchmark.submitThreeThreads                       0  combiner  thrpt   20  11.596 ±  0.978  ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed             0  combiner  thrpt   20  11.599 ±  0.978  ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted             0  combiner  thrpt   20  11.599 ±  0.978  ops/us
CombinerExecutorBenchmark.submitThreeThreads                       1      sync  thrpt   20  22.583 ±  1.232  ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed             1      sync  thrpt   20  22.585 ±  1.232  ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted             1      sync  thrpt   20  22.585 ±  1.232  ops/us
CombinerExecutorBenchmark.submitThreeThreads                       1  combiner  thrpt   20  11.525 ±  0.867  ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed             1  combiner  thrpt   20  11.529 ±  0.867  ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted             1  combiner  thrpt   20  11.529 ±  0.867  ops/us
CombinerExecutorBenchmark.submitThreeThreads                      10      sync  thrpt   20  20.801 ±  0.873  ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed            10      sync  thrpt   20  20.802 ±  0.873  ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted            10      sync  thrpt   20  20.802 ±  0.873  ops/us
CombinerExecutorBenchmark.submitThreeThreads                      10  combiner  thrpt   20  11.474 ±  0.111  ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed            10  combiner  thrpt   20  11.477 ±  0.111  ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted            10  combiner  thrpt   20  11.477 ±  0.111  ops/us
CombinerExecutorBenchmark.submitThreeThreads                      20      sync  thrpt   20  13.453 ±  0.592  ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed            20      sync  thrpt   20  13.454 ±  0.592  ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted            20      sync  thrpt   20  13.454 ±  0.592  ops/us
CombinerExecutorBenchmark.submitThreeThreads                      20  combiner  thrpt   20  10.800 ±  0.693  ops/us
CombinerExecutorBenchmark.submitThreeThreads:completed            20  combiner  thrpt   20  10.803 ±  0.693  ops/us
CombinerExecutorBenchmark.submitThreeThreads:submitted            20  combiner  thrpt   20  10.803 ±  0.693  ops/us
CombinerExecutorBenchmark.submitTwoThreads                         0      sync  thrpt   20  13.214 ±  1.548  ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed               0      sync  thrpt   20  13.215 ±  1.548  ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted               0      sync  thrpt   20  13.215 ±  1.548  ops/us
CombinerExecutorBenchmark.submitTwoThreads                         0  combiner  thrpt   20  21.587 ± 17.779  ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed               0  combiner  thrpt   20  21.717 ± 18.038  ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted               0  combiner  thrpt   20  21.717 ± 18.038  ops/us
CombinerExecutorBenchmark.submitTwoThreads                         1      sync  thrpt   20  12.017 ±  1.040  ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed               1      sync  thrpt   20  12.018 ±  1.040  ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted               1      sync  thrpt   20  12.018 ±  1.040  ops/us
CombinerExecutorBenchmark.submitTwoThreads                         1  combiner  thrpt   20   8.986 ±  1.087  ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed               1  combiner  thrpt   20   8.989 ±  1.088  ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted               1  combiner  thrpt   20   8.989 ±  1.088  ops/us
CombinerExecutorBenchmark.submitTwoThreads                        10      sync  thrpt   20   9.651 ±  0.986  ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed              10      sync  thrpt   20   9.652 ±  0.986  ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted              10      sync  thrpt   20   9.652 ±  0.986  ops/us
CombinerExecutorBenchmark.submitTwoThreads                        10  combiner  thrpt   20  10.229 ±  0.317  ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed              10  combiner  thrpt   20  10.231 ±  0.317  ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted              10  combiner  thrpt   20  10.231 ±  0.317  ops/us
CombinerExecutorBenchmark.submitTwoThreads                        20      sync  thrpt   20   9.041 ±  0.612  ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed              20      sync  thrpt   20   9.041 ±  0.613  ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted              20      sync  thrpt   20   9.041 ±  0.613  ops/us
CombinerExecutorBenchmark.submitTwoThreads                        20  combiner  thrpt   20  10.482 ±  0.209  ops/us
CombinerExecutorBenchmark.submitTwoThreads:completed              20  combiner  thrpt   20  10.490 ±  0.217  ops/us
CombinerExecutorBenchmark.submitTwoThreads:submitted              20  combiner  thrpt   20  10.490 ±  0.217  ops/us

TLDR the semaphore based one seems to perform better (or slightly worse) then the combiner impl: maybe there's something obvious I'm missing but this requires some investigation

Sadly AuxCounters are collected and merged by JMH, and cannot report per-thread statistics (that would be nice); I can send a PR to JMH for that because it would be nice to know how the executions of tasks is distributed between threads - here I can create a counter but I the reporting won't say anything useful

Jul 11 '22 12:07 franz1981

I think we are not looking to compare this to the semaphore version because we are not comparing apples to apples. The semaphore version will create contention and potentially block all threads and the lock free version will not.

We should instead compare performance to the same implementation without a deadline to see that the changes will not increase too much the cost of the submission.

Jul 11 '22 13:07 vietj

I think we are not looking to compare this to the semaphore version because we are not comparing apples to apples. The semaphore version will create contention and potentially block all threads and the lock free version will not.

It's fair, and indeed the benchmark itself is not comparing the same behaviour nor the kind of service we expect by the executor but still...it has highlighted an important bottleneck.

The semaphore version won't allow too many contended requests on the semaphore (because the winner will enter and the looser will park, without performing any CAS loop), while the JCTools chosen queue, instead, is using a CAS loop to submit the action; just switching it with the JCTools mpsc unbounded xadd q, performance will increase a lot given that who designed it (myself) tried hard to make it a better performer under high contention (on x86, while M1 Apple users and PowerPC won't benefit by this optimization).

This same scenario can happen in a real world case if the event loop acting as combiner got other N concurrent event loops competing to add to the action queue; they will end up slowing each others because of the contention (!)

I can show some number if you think that's a valid use case

Jul 11 '22 19:07 franz1981

These are the numbers with the master benchmark (using 2 threads):

Benchmark                        Mode  Cnt      Score      Error   Units
CombinerExecutorBenchmark.impl  thrpt   10  11484.804 ± 5100.280  ops/ms

These are by using the mpsc unbounded xadd q:

Benchmark                        Mode  Cnt      Score      Error   Units
CombinerExecutorBenchmark.impl  thrpt   10  17208.869 ± 1450.893  ops/ms

With 8 threads:

Benchmark                        Mode  Cnt     Score     Error   Units
CombinerExecutorBenchmark.impl  thrpt   10  6423.568 ± 361.666  ops/ms

8 threads using the mpsc unbounded xadd q:

Benchmark                        Mode  Cnt      Score      Error   Units
CombinerExecutorBenchmark.impl  thrpt   10  27310.143 ± 3168.112  ops/ms

The CAS loop of the original queue is causing a diminishing return on a burst of requests due to the contention, while the xadd one isn't affected; there are other effects in place (q::poll is much cheaper on the cas q) that can affect the benchmark but the behaviour under contention is quite real.

Jul 11 '22 20:07 franz1981

This same scenario can happen in a real world case if the event loop acting as combiner got other N concurrent event loops competing to add to the action queue; they will end up slowing each others because of the contention (!)

In practice I think this is the case of a benchmark, we are talking of event-loop threads which have plenty of tasks to perform, including IO and that access a database pool to perform interactions with the database.

Jul 12 '22 06:07 vietj

vert.x vert.x copied to clipboard

Improve fairness of SimpleConnectionPool by yielding its combiner executor

vert.x
vert.x copied to clipboard