Valkey CPU usage is not a comprehensive indicator of server busyness / load
Hi, I measured how Valkey main thread and all threads cpu usage varies with the QPS served by the server. An 8 vcpu (4 cores) C4 VM on GCP is used for running the Valkey server. A memtier client running on another VM is used for benchmarking. Valkey 'io-threads' is set to 4. Tests are conducted on Valkey 8.0.
Observations
The main thread cpu usage grows almost linearly with QPS until the cpu usage approaches ~80%. Beyond that, it transitions to an asymptotic curve.
Similarly, the all threads cpu usage grows almost linearly until it approaches ~400%. Beyond that, the cpu usage remains at 400% while the server is able to drive more QPS.
Here is the client observed latency
Here are 2 data points from the graphs for comparison:
| connected_clients | QPS | Main thread cpu usage (s/s) | All threads cpu usage (s/s) | P99 latency (ms) |
|---|---|---|---|---|
| 32 | 217767.63 | 0.98 | 3.98 | 0.2 |
| 128 | 452501.27 | 1 | 4 | 0.38 |
I also profiled the server for these 2 workloads. The table below shows a high-level summary. The flame graphs are attached at the end of the issue.
Main thread profile
| 32 clients | 128 clients | |
|---|---|---|
| processIOThreadsReadDone | 42.56% | 76.56% |
| processIOThreadsWriteDone | 4.16% | 7.44% |
| readQueryFromClient | 7.2% | 2.8% |
| handleClientsWithPendingWrites | 3.92% | 0% |
| clock_gettime | 15.68% | 3.84% |
IO-threads profile (combined usage across 3 threads)
| 32 clients | 128 clients | |
|---|---|---|
| IOThreadWriteToClient | 57.12% | 141.68% |
| IOThreadReadQueryFromClient | 41.6% | 95.36% |
| IOThreadPoll | 9.44% | 16.8% |
| IOThreadFreeArgv | 3.6% | 8.24% |
Problem
It is difficult to monitor if the server is overloaded based on the cpu utilization. For example, the main thread cpu utilization is close to 100% and all threads cpu utilization is 400% both at a QPS of 217K as well as at a QPS of 452K.
Also, typically, users might want to run the server with cpu usage below a certain threshold (say 80%), so that there is enough room for tasks like full sync, snapshots etc. and to be able to absorb intermittent traffic spikes etc. Due to the cpu usage not growing linearly with the QPS, the actual QPS that can be driven in practice with such constraints would be a small fraction of the max QPS the server can actually support.
Analysis
For the all threads cpu usage, the behavior seems to be as expected since the IO threads run in a busy loop while they are active. So, the cpu usage is ~100% per thread while there is sufficient load to keep them active (>2 events-per-io-thread with the default configuration..
I am not sure how the asymptotic behavior for the main thread cpu usage can be explained (server is able to go from serving 217K QPS to serving 452K QPS, with the main thread cpu usage staying at ~100%). Is it mainly that the command processing cost is amortized when processing a bigger batch of commands (for eg. memory access amortization with prefetching etc.)? What else could explain this behavior?
Also, the high cpu cycles attributed to clock_gettime, especially under moderate load is odd.
Potential improvements
Some initial ideas around how io-threads can be administered more effectively:
- Improve the heuristics around when IO threads are activated. Currently, the number of active io threads is determined based on the event load. The default is 2 events per io thread. This seems to be a bit aggressive for the above setup and workload, since the IO threads are idly busy looping for the most part under moderate workload.
- It might not be trivial to tune events-per-io-thread config since the optimal value could vary with workload, cpu type etc.
- Are there any other indicators for load (other than the event count) that might be more effective here? For eg. the cpu duration spent by IO threads doing actual work etc.?
- Alternatively, we could consider not running the io-threads in a busy loop, for example using conditional variable signaling to activate them when work is actually queued by the main thread. This would minimize the cpu usage for the io-threads. However, I guess this could impact the latency. Not sure if any test data is available that measures the impact.
- Add observability around io-threads busyness. For example, the cpu duration spent by IO threads doing actual work, IO queue depths etc.
As for the main thread cpu usage, it is unclear how the server is able to go from serving 217K QPS to serving 452K QPS, with the main thread cpu usage staying at 100%.
Any other ideas on how we can enhance server load monitoring?
Appendix
Test setup
The following command is used to run the Valkey 8.0 server on a c4-highmem-8 VM on GCP (8 vcpu, 62GB)
src/valkey-server --io-threads 4 --save --protected-mode no
We first pre-populate 12 million keys (2KB values each) into the instance from a memtier client running on another GCP VM.
Then, we run the following memtier command from the client VM, with varying CLIENTS and THREADS count.
memtier_benchmark --server ${IP:?} -p ${PORT:?} -d 2048 --pipeline 1 --key-minimum 1 --key-maximum=12000000 --ratio 1:4 --key-pattern R:R --test-time 60 --print-percentiles 50,90,95,99,99.9 --hide-histogram --clients ${CLIENTS:?} --threads ${THREADS:?}
We also run a script that captures the cpu stats from the Valkey instance every 15 seconds (sampling_interval). The stats are later processed to compute the main thread cpu usage and all threads cpu usage for the instance Main thread cpu usage is computed as follows: Delta (used_cpu_sys_main_thread + used_cpu_user_main_thread) / sampling_interval All threads cpu usage is computed as follows: Delta (used_cpu_sys + used_cpu_user) / sampling_interval
Flame graph with 32 clients load
Overall
Main thread
Flame graph with 128 clients load
Overall
Main thread
@uriyage
Very interesting. We know for sure that the CPU consumption is not linear with the QPS growth (as you correctly stated due to improvements such as batch prefetching). It is also non linear since the mechanism to put the engine to sleep will sometimes consume cpu cycles even when the amount of work is low and will use the same cycles when the work coming form the io-threads increase.
I think the main point (as you also stated) here is to provide a good way in order to evaluate the engine load in order to take maintenance decisions (scaling etc...). will have to place some thought into it.
Great observations!
- Add observability around io-threads busyness. For example, the cpu duration spent by IO threads doing actual work, IO queue depths etc.
Yes, CPU utilization is not a useful metric for a thread that does some amount of busy-looping. We could provide some actual-work metric. We just need to keep track of the fraction of wall clock time that the thread is doing useful work compared to how long it is executing.
As for the main thread cpu usage, it is unclear how the server is able to go from serving 217K QPS to serving 452K QPS, with the main thread cpu usage staying at 100%.'
Yeah this is likely due to memory prefetching. The main thread effectively executes commands in batches. With more load, the IO threads queue up more commands and the main thread can prefetch keys in larger batches, thus spend less time waiting for slow memory accesses.
@PingXie
Alternatively, we could consider not running the io-threads in a busy loop, for example using conditional variable signaling to activate them when work is actually queued by the main thread. This would minimize the cpu usage for the io-threads. However, I guess this could impact the latency. Not sure if any test data is available that measures the impact.
For reference, I experimented with using conditional variables to replace the existing IO threads busy loop to improve CPU efficiency. However, this resulted in increased latency and lower QPS metrics. I also tried using umwait[1] and umonitor[2], but abandoned these due to performance issues.
[1]https://www.felixcloutier.com/x86/umwait [2]https://www.felixcloutier.com/x86/umonitor
Thanks @ranshid and @zuiderkwast for shedding light on why the main thread CPU consumption is not linear with the QPS growth. It makes sense.
Thinking about practical implications of this behavior. Since most applications would not want to run Valkey at 100% main thread cpu utilization (to leave some room for other operations like full sync etc.), in practice, they would be able to achieve only a small fraction of the max QPS that the Valkey server can offer.
For example, for this workload, if operating at 90% main thread cpu usage, the QPS would be ~170K, whereas the max QPS the server can offer can be up to ~465K.
Does that match your understanding?
Yes, CPU utilization is not a useful metric for a thread that does some amount of busy-looping. We could provide some actual-work metric. We just need to keep track of the fraction of wall clock time that the thread is doing useful work compared to how long it is executing.
Yeah, adding this metric makes sense. Btw, would it also make sense to consider using this io-thread busyness as the feedback mechanism for adjusting the number of active io threads (instead of event count)?
For reference, I experimented with using conditional variables to replace the existing IO threads busy loop to improve CPU efficiency. However, this resulted in increased latency and lower QPS metrics. I also tried using umwait[1] and umonitor[2], but abandoned these due to performance issues.
That's very interesting! Do you happen to have the results handy?
As for the main thread cpu usage, it is unclear how the server is able to go from serving 217K QPS to serving 452K QPS, with the main thread cpu usage staying at 100%.'
@zuiderkwast , @sumish163 This is because the main thread is busy waiting, similar to the IO threads, waiting for jobs to return from IO threads. The main thread does not call epoll_wait until all jobs return. Instead, it offloads the epoll call to one of the threads and continues running the main event loop.
As you mentioned, putting the threads/main-thread to sleep on a condition variable is very expensive performance-wise, as the sleep can take up to 50 micro even if we immediately wake it up. The design was to maximize throughput and latency, not to minimize CPU usage time.
I agree we can implement a better mechanism than events-count to determine the required IO threads. The events count mechanism has existed since 2019 and could be improved.