valkey
valkey copied to clipboard
[BUG] Using more sentinels than io-threads causes high idle CPU usage on leader
Describe the bug
Running a higher number of sentinels than io-threads causes significant CPU usage on a leader with no application load: in some cases most of a core.
To reproduce
I can trigger this with:
- 6 sentinels and any number of other nodes
- 7 sentinels and 1 leader without any replicas.
It's not a subtle difference: in the above scenarios if I stop one of the sentinels the leader CPU usage drops to near 0 as expected.
How much CPU is being used by the leader? It depends on the number of IO threads. Rough numbers (it's pretty jittery) on an average virtualized machine as percentage of 1 core, for the 6 sentinel case:
- io-threads 1: 0%
- io-threads 2: 20%
- io-threads 3: 35%
- io-threads 4: 55%
- io-threads 5: 75%
- io-threads 6: 85%
- io-threads 7: 0%
- io-threads 8: 0%
In some of my tests the dropoff back to idle CPU usage happened at io-threads >= 5 instead of >=7 which I haven't quite nailed down yet. However, there is some number of io-threads above which idle usage drops to 0 as expected.
What is the leader doing? Perf shows that the busyness is attributed entirely to (io-threads - 1) theads doing this:
Percent│ nop
│ 80:┌─→sub $0x1,%eax
0.56 │ │↓ je 8e
│ │getIOPendingCount():
│ 85:│ mov 0x0(%rbp),%rdx
│ │IOThreadMain():
2.44 │ ├──test %rdx,%rdx
97.00 │ └──je 80
│ getIOPendingCount():
Another odd data point: counterintuitively, increasing the value of 'hz' to 50 or above makes the CPU usage go down significantly, but not to 0 where it should be.
Expected behavior
A leader being followed by any sane number of sentinels and 0 application load should have near-0 CPU usage.
Additional information
MONITOR shows me normal PING and PUBLISH traffic that I would expect from sentinels. INFO shows io_threads_active:0 while unexpected CPU usage is happening Valkey 7.2.6, kernel 6.1.99-1 Happy to collect anything else or to do further debugging with some guidance.
A variation of this reproduces with 8.0.0-rc2: Unexpected CPU usage is observed with any io-threads setting other than '1', and does not go away if you set io-threads to a large value.
I think I've convinced myself that this is just the io-threads polling in a busy loop under light but non-zero load and is to be expected. I'll plan to make a documentation contribution for that unless someone thinks this is a real problem.
We have observed similar high CPU usage with 8.0.0 with one leader setup (no replicas). After upgrading to 8.1.0 the issue was resolved.