valkey icon indicating copy to clipboard operation
valkey copied to clipboard

[BUG] Using more sentinels than io-threads causes high idle CPU usage on leader

Open lukepalmer opened this issue 1 year ago • 3 comments

Describe the bug

Running a higher number of sentinels than io-threads causes significant CPU usage on a leader with no application load: in some cases most of a core.

To reproduce

I can trigger this with:

  • 6 sentinels and any number of other nodes
  • 7 sentinels and 1 leader without any replicas.

It's not a subtle difference: in the above scenarios if I stop one of the sentinels the leader CPU usage drops to near 0 as expected.

How much CPU is being used by the leader? It depends on the number of IO threads. Rough numbers (it's pretty jittery) on an average virtualized machine as percentage of 1 core, for the 6 sentinel case:

  • io-threads 1: 0%
  • io-threads 2: 20%
  • io-threads 3: 35%
  • io-threads 4: 55%
  • io-threads 5: 75%
  • io-threads 6: 85%
  • io-threads 7: 0%
  • io-threads 8: 0%

In some of my tests the dropoff back to idle CPU usage happened at io-threads >= 5 instead of >=7 which I haven't quite nailed down yet. However, there is some number of io-threads above which idle usage drops to 0 as expected.

What is the leader doing? Perf shows that the busyness is attributed entirely to (io-threads - 1) theads doing this:

Percent│       nop
       │ 80:┌─→sub    $0x1,%eax
  0.56 │    │↓ je     8e
       │    │getIOPendingCount():
       │ 85:│  mov    0x0(%rbp),%rdx
       │    │IOThreadMain():
  2.44 │    ├──test   %rdx,%rdx
 97.00 │    └──je     80
       │     getIOPendingCount():

Another odd data point: counterintuitively, increasing the value of 'hz' to 50 or above makes the CPU usage go down significantly, but not to 0 where it should be.

Expected behavior

A leader being followed by any sane number of sentinels and 0 application load should have near-0 CPU usage.

Additional information

MONITOR shows me normal PING and PUBLISH traffic that I would expect from sentinels. INFO shows io_threads_active:0 while unexpected CPU usage is happening Valkey 7.2.6, kernel 6.1.99-1 Happy to collect anything else or to do further debugging with some guidance.

lukepalmer avatar Sep 11 '24 20:09 lukepalmer

A variation of this reproduces with 8.0.0-rc2: Unexpected CPU usage is observed with any io-threads setting other than '1', and does not go away if you set io-threads to a large value.

lukepalmer avatar Sep 12 '24 14:09 lukepalmer

I think I've convinced myself that this is just the io-threads polling in a busy loop under light but non-zero load and is to be expected. I'll plan to make a documentation contribution for that unless someone thinks this is a real problem.

lukepalmer avatar Sep 12 '24 16:09 lukepalmer

We have observed similar high CPU usage with 8.0.0 with one leader setup (no replicas). After upgrading to 8.1.0 the issue was resolved.

ramasauskas avatar Nov 07 '25 06:11 ramasauskas