valkey Batch applying events to kqueue

kqueue has the capability of batch applying events:

The kevent,() kevent64() and kevent_qos() system calls are used to register events with the queue, and return any pending events to the user. The changelist argument is a pointer to an array of kevent, kevent64_s or kevent_qos_s structures, as defined in <sys/event.h>. All changes contained in the changelist are applied before any pending events are read from the queue. The nchanges argument gives the size of changelist.

This PR implements this functionality for kqueue with which we're able to reduce plenty of system calls of kevent(2).

May 07 '24 04:05 panjf2000

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 70.23%. Comparing base (5a51bf5) to head (92b6d80).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #449      +/-   ##
============================================
+ Coverage     70.17%   70.23%   +0.05%     
============================================
  Files           110      110              
  Lines         60077    60077              
============================================
+ Hits          42160    42195      +35     
+ Misses        17917    17882      -35

see 13 files with indirect coverage changes

May 07 '24 04:05 codecov[bot]

Since we don't automatically test on FreeBSD: https://github.com/valkey-io/valkey/actions/runs/8994018097

May 08 '24 00:05 madolson

Since we don't automatically test on FreeBSD: valkey-io/valkey/actions/runs/8994018097

Thanks, I see it's succeeded. Just out of curiosity, why didn't we set up a macOS runner that would run by default?

May 08 '24 03:05 panjf2000

Thanks, I see it's succeeded. Just out of curiosity, why didn't we set up a macOS runner that would run by default?

Runner execution time was the historical reason, we didn't want to burn resources on running all tests, since we have a daily set of tests that are slightly more comprehensive.

May 08 '24 15:05 madolson

Hi @madolson, any new thoughts on this?

May 09 '24 13:05 panjf2000

@panjf2000 It makes sense to me, I just haven't had a sense to walk through it and convince myself the code is correct. Hopefully tomorrow?

May 10 '24 02:05 madolson

@panjf2000 It makes sense to me, I just haven't had a sense to walk through it and convince myself the code is correct. Hopefully tomorrow?

Sure, take your time. I was asking only to ensure that you'd like to continue this reveiwing.

May 10 '24 02:05 panjf2000

Do we expect this to make the code more performant?

This PR is expected to conserve plenty of extra system calls to kevent(2), which should have a significant benefit to the performance.

May 13 '24 01:05 panjf2000

which should have a significant benefit to the performance.

Would you mind using valkey-benchmark -r 10000000 -d 100 -n 10000000 with and without to change to compare the performance? I'm not very happy with the odd failure modes, so trying to understand how much performance we're really going to see.

May 13 '24 17:05 madolson

which should have a significant benefit to the performance.

Would you mind using valkey-benchmark -r 10000000 -d 100 -n 10000000 with and without to change to compare the performance? I'm not very happy with the odd failure modes, so trying to understand how much performance we're really going to see.

Environment

 Model : Mac Studio (2022)
    OS : macOS 14.1.2
   CPU : Apple M1 Max, 10-core CPU with 8 performance cores and 2 efficiency cores
Memory : 32GB unified memory

Benchmark command

# With RDB and AOF disabled on the server.
valkey-benchmark -r 10000000 -d 100 -n 10000000 -q

Benchmark results

valkey-io:unstable

PING_INLINE: 141031.80 requests per second, p50=0.175 msec
PING_MBULK: 164306.14 requests per second, p50=0.159 msec
SET: 160627.09 requests per second, p50=0.191 msec
GET: 156184.11 requests per second, p50=0.175 msec
INCR: 161022.81 requests per second, p50=0.167 msec
LPUSH: 164573.83 requests per second, p50=0.167 msec
RPUSH: 164098.53 requests per second, p50=0.167 msec
LPOP: 163655.41 requests per second, p50=0.167 msec
RPOP: 162211.27 requests per second, p50=0.167 msec
SADD: 160130.66 requests per second, p50=0.167 msec
HSET: 151906.42 requests per second, p50=0.191 msec
SPOP: 158737.72 requests per second, p50=0.175 msec
ZADD: 127665.01 requests per second, p50=0.319 msec
ZPOPMIN: 158866.33 requests per second, p50=0.175 msec
LPUSH (needed to benchmark LRANGE): 164546.77 requests per second, p50=0.167 msec
LRANGE_100 (first 100 elements): 86594.33 requests per second, p50=0.311 msec
LRANGE_300 (first 300 elements): 36351.34 requests per second, p50=0.663 msec
LRANGE_500 (first 500 elements): 22212.60 requests per second, p50=0.751 msec
LRANGE_600 (first 600 elements): 19722.66 requests per second, p50=0.839 msec
MSET (10 keys): 54174.70 requests per second, p50=0.839 msec
XADD: 157696.38 requests per second, p50=0.175 msec

panjf2000:kqueue-batch

PING_INLINE: 159212.86 requests per second, p50=0.159 msec
PING_MBULK: 184396.38 requests per second, p50=0.143 msec
SET: 173722.70 requests per second, p50=0.207 msec
GET: 179742.97 requests per second, p50=0.159 msec
INCR: 182832.06 requests per second, p50=0.159 msec
LPUSH: 174146.25 requests per second, p50=0.151 msec
RPUSH: 174794.62 requests per second, p50=0.151 msec
LPOP: 173250.17 requests per second, p50=0.151 msec
RPOP: 175260.27 requests per second, p50=0.151 msec
SADD: 181957.12 requests per second, p50=0.159 msec
HSET: 174380.08 requests per second, p50=0.175 msec
SPOP: 186985.80 requests per second, p50=0.159 msec
ZADD: 139279.66 requests per second, p50=0.303 msec
ZPOPMIN: 182648.41 requests per second, p50=0.151 msec
LPUSH (needed to benchmark LRANGE): 180554.31 requests per second, p50=0.151 msec
LRANGE_100 (first 100 elements): 93919.64 requests per second, p50=0.287 msec
LRANGE_300 (first 300 elements): 37977.62 requests per second, p50=0.639 msec
LRANGE_500 (first 500 elements): 23096.98 requests per second, p50=0.719 msec
LRANGE_600 (first 600 elements): 20357.48 requests per second, p50=0.815 msec
MSET (10 keys): 54659.74 requests per second, p50=0.823 msec
XADD: 170558.23 requests per second, p50=0.159 msec

@madolson

May 14 '24 13:05 panjf2000

Ok, those performance numbers are compelling, although I'm a little surprised at the result. During the typical command flow, we shouldn't call any of these functions, so I'm surprised to see so much improvement.

May 14 '24 16:05 madolson

The benchmark for commands of retrieving data with pipeline could call aeApiAddEvent for AE_WRITABLE. Posting these numbers for another reference.

./valkey-benchmark -r 10000000 -d 1024 -n 10000000 -P 100 -t get,lpop,rpop,spop,zpopmin -q

valkey-io:unstable:

GET: 3898635.50 requests per second, p50=1.151 msec
LPOP: 3837298.50 requests per second, p50=1.183 msec
RPOP: 3752345.25 requests per second, p50=1.215 msec
SPOP: 3979307.50 requests per second, p50=1.135 msec
ZPOPMIN: 3790750.50 requests per second, p50=1.199 msec

panjf2000:kqueue-batch:

GET: 4093327.75 requests per second, p50=1.095 msec
LPOP: 3971406.00 requests per second, p50=1.143 msec
RPOP: 4127115.00 requests per second, p50=1.095 msec
SPOP: 4152824.00 requests per second, p50=1.095 msec
ZPOPMIN: 3977724.75 requests per second, p50=1.143 msec

@madolson

May 17 '24 03:05 panjf2000

Ok, so I'm still worried about the panic and how it might impact production systems for edge cases related to kqueue failures. I would like to get the performance improvement, but don't want that to come at the cost of potential failures. Would you be happy with this being some type of build flag so that folks have to opt-in to it at build time?

May 19 '24 19:05 madolson

Ok, so I'm still worried about the panic and how it might impact production systems for edge cases related to kqueue failures. I would like to get the performance improvement, but don't want that to come at the cost of potential failures. Would you be happy with this being some type of build flag so that folks have to opt-in to it at build time?

I'm OK with that. But I'm not sure how you plan on doing that. The first way I can think of is to add a new configuration in valkey.conf. Is this what you had in mind? @madolson

May 19 '24 22:05 panjf2000

Another approach could be Makefile flags, something like BUILD_TLS or USE_SYSTEMD. This looks more feasible.

May 19 '24 22:05 panjf2000

@madolson Please check out the latest commit.

May 20 '24 01:05 panjf2000

@panjf2000, do you mind counting the number of kevent calls before and after this change for a given workload (./valkey-benchmark -r 10000000 -d 1024 -n 10000000 -P 100 -t get,lpop,rpop,spop,zpopmin -q would be fine too)? My understanding is that we only redister/de-register events on connection establishment and tear-down. It will be really helpful if you could show direct evidence that # of kevent calls was indeed significant and greatly reduced.

panic() on failure makes sense to me. Not being able to register/de-register events is a critical error IMO. The existing error handling of either ignoring the failure or trying to revert the change doesn't make much sense to me. That said, this is indeed a behavior/contract change for the event loop so introducing it with a build macro at the moment is safer. We should also update the README.md to make the behavior change more explicit. I think it is a bit too early to have it controlled by a server config.

May 20 '24 03:05 PingXie

@panjf2000, do you mind counting the number of kevent calls before and after this change for a given workload (./valkey-benchmark -r 10000000 -d 1024 -n 10000000 -P 100 -t get,lpop,rpop,spop,zpopmin -q would be fine too)? My understanding is that we only redister/de-register events on connection establishment and tear-down. It will be really helpful if you could show direct evidence that # of kevent calls was indeed significant and greatly reduced.

panic() on failure makes sense to me. Not being able to register/de-register events is a critical error IMO. The existing error handling of either ignoring the failure or trying to revert the change doesn't make much sense to me. That said, this is indeed a behavior/contract change for the event loop so introducing it with a build macro at the moment is safer. We should also update the README.md to make the behavior change more explicit. I think it is a bit too early to have it controlled by a server config.

Not only do we register and unregister events for connections and disconnections, we also do that with AE_WRITABLE when there is (no) pending data to write.

As for the statistics of the system calls to kevent, I've tried to count the system calls of kevent before, but the equivalent of strace on macOS -- dtruss is hard to use with SIP enabled, and the other one ktrace seems to be broken and deprecated. I guess I can try again later...

May 20 '24 03:05 panjf2000

Benchmark command

./valkey-benchmark -r 10000000 -d 1024 -n 10000000 -P 100 -t get,lpop,rpop,spop,zpopmin -q

Branch: valkey-io:unstable

CALL                                        COUNT
…
accept                                         64
fcntl                                         174
setsockopt                                    263
kevent                                       1227
read                                        22255
write                                       22257

GET: 3496503.50 requests per second, p50=1.223 msec
LPOP: 3617945.00 requests per second, p50=1.191 msec
RPOP: 3617945.00 requests per second, p50=1.191 msec
SPOP: 3553660.50 requests per second, p50=1.207 msec
ZPOPMIN: 3673769.50 requests per second, p50=1.175 msec

Branch: panjf2000:kqueue-batch

CALL                                        COUNT
…
accept                                         67
fcntl                                         174
setsockopt                                    263
kevent                                        903
read                                        21085
write                                       21114

GET: 3834355.75 requests per second, p50=1.135 msec
LPOP: 3815337.50 requests per second, p50=1.151 msec
RPOP: 3675119.50 requests per second, p50=1.207 msec
SPOP: 3707823.50 requests per second, p50=1.191 msec
ZPOPMIN: 3732736.25 requests per second, p50=1.159 msec

May 20 '24 13:05 panjf2000

Ping @madolson @PingXie

May 21 '24 03:05 panjf2000

@PingXie Related to your comment about event handling, I think we should look more seriously into making sure that connSet(Read|Write)Handler always succeed. Once that is done, we can update the kqueue batch handler. I documented my thoughts here: https://github.com/valkey-io/valkey/issues/528.

May 22 '24 00:05 madolson

Once that is done, we can update the kqueue batch handler.

To make it clearer, by "update the kqueue batch handler", you meant removing the USE_KQUEUE_BATCH macro in this PR and make it the only code path, right? If so, this PR is good to go? @madolson

May 22 '24 03:05 panjf2000

Or, did you mean that we need to fix #528 by changing the return type in the function signatures of connSet(Read/Write) from int to void (option 2 in #528, it seems more pragmatic) before we move forward with this PR? If that is it, I think I can take on that and open a new PR for #528. Or say, we do it in this PR directly, which can avoid some conflicts between PRs.

May 22 '24 04:05 panjf2000

To make it clearer, by "update the kqueue batch handler", you meant removing the USE_KQUEUE_BATCH macro in this PR and make it the only code path, right?

Yeah, this one. I think we can start with the USE_KQUEUE_BATCH macro and remove the USE_KQUEUE_BATCH when we solve some of the issues with aeCreateFileEvent sometimes returning an error. If it never throws an error, it should be safe to always batch the updates.

May 22 '24 16:05 madolson

To make it clearer, by "update the kqueue batch handler", you meant removing the USE_KQUEUE_BATCH macro in this PR and make it the only code path, right?

Yeah, this one. I think we can start with the USE_KQUEUE_BATCH macro and remove the USE_KQUEUE_BATCH when we solve some of the issues with aeCreateFileEvent sometimes returning an error. If it never throws an error, it should be safe to always batch the updates.

Great! Then I guess we can continue the code review here? @madolson

May 22 '24 16:05 panjf2000

Ping @madolson

May 26 '24 11:05 panjf2000

@panjf2000 Thanks for you patience, just been chasing other stuff. Will finish and review now!

May 26 '24 17:05 madolson

@panjf2000 Ok. I just can't convince myself this is a good change to take right now. We know we don't handle edge cases well, so I'm not able to really convince myself that this configuration will be stable enough for someone to enable in production. I think we should at least address https://github.com/valkey-io/valkey/issues/528, and then maybe think through more about the API contract for the event loop.

We can leave this open and merge at a later date. I just don't really feel comfortable merging it now.

May 26 '24 19:05 madolson

Sometimes I've wondered why Salvatore wrote his own event lib instead of using an existing well-tested one (like libev) and the answer is in this ancient thread: https://groups.google.com/g/redis-db/c/tSgU6e8VuNA

Perhaps it's time to re-evaluate this? The situation may have changed after 15 years.

May 26 '24 22:05 zuiderkwast

@panjf2000 Ok. I just can't convince myself this is a good change to take right now. We know we don't handle edge cases well, so I'm not able to really convince myself that this configuration will be stable enough for someone to enable in production. I think we should at least address #528, and then maybe think through more about the API contract for the event loop.

We can leave this open and merge at a later date. I just don't really feel comfortable merging it now.

I thought the main reason why we introduced the USE_KQUEUE_BATCH macro was to alleviate this kind of concern? IMO, It's more like an experimental optimization if we choose to control it via a build macro instead of an official server configuration, also it's disabled by default. Besides, those edge cases were already happening way before this PR, so we don't introduce more issues here, if anything, enabling this feature may actually expose these unbeknown bugs that should be fixed ASAP. @madolson

May 27 '24 01:05 panjf2000

valkey valkey copied to clipboard

Batch applying events to kqueue

Codecov Report

Environment

Benchmark command

Benchmark results

valkey-io:unstable

panjf2000:kqueue-batch

Benchmark command

Branch: valkey-io:unstable

Branch: panjf2000:kqueue-batch

valkey
valkey copied to clipboard