Maxmemory key evictions could cause CPU starvation during the heavy workload
Problem description
KeyDB uses Redis' maxmemory eviction mechanism to evict keys when it reaches its maximum memory capacity based on a user-defined policy. We observed "0 QPS" throughput when enabling storage integration using the iStorage plugin or with heavy workloads from the client (e.g. see below an example of the memtier command which we executed during our tests).
Our investigation indicates that when heavy workload is continuous at the time when the maxmemory capacity is reached, the "insert" key requests would be held until the maxmemory eviction is performed and completed, and then the keys would be inserted into KeyDB (both for the memory - kv cache use case and for the storage - kv store use case). The CPU is consumed by running evictions with large volumes, but threads are starved of CPU cycles, resulting in a "0 QPS" throughput.
As soon as maxmemory capacity is reached, this happens repeatedly for the kv-store use case. Large number of maxmemory eviction calls to be processed.
Recommendation of a fix
A fix for the issue is to improve the current maxmemory eviction mechanism as described in the following. The goal is to divide the large amount of the load rather than leaving the bottleneck of running eviction at the time when the maximum memory is reached (especially essential for the kv-store use case):
-
What if we introduced a new redis setting called "maxmemory-eviction-threshold-percent" . This would allow the user to decide at what percentage, before the maxmemory is exhausted, KeyDB should run small scale (at slow pace) of key evictions in the background. Since it has not yet reached the maximum memory capacity, it will not at this time block (lock) all key inserts, but rather rely on background jobs to evict fewer sets of keys, and increase the eviction rate over time. As a result, a heavy load of evictions is not left until the end, but is divided into smaller jobs to be completed sooner. The user can, for example, set the "maxmemory-eviction-threshold-percent" value as 80%, in which case keys begin eviction slower and gradually increase until reaching the maximum memory capacity.
-
Due to the fact that keys must also be written to storage (e.g. an SSD use case), this enhancement dealing with maxmemory evictions is critical, otherwise it will negatively impact performance. The same impact occurs when keys are read from the storage and warmed up in the memory when the maximum capacity of the memory is reached (a typical SSD use case).
To reproduce
To reproduce the problem, set a smaller maxmemory setting (for example: 5 GB), and then set a larger maxstorage setting (for example: 20 GB). Run the following memtier command to start a heavier workload:
memtier_benchmark -s 192.168.0.65 -p 7001 -t 5 -c 80 -n 500000 -d 32 --key-minimum=1 --key-maximum=200000000 --ratio 1:0 --key-pattern=P:P --hide-histogram
Note: You will see many "0 QPS" throughput reported by the memtier client tool right after the maxmemory capacity is reached. i used a master/slave configuration to reproduce this issue, having the following settings when I configured:
[Master]:
bind 192.168.0.65 port 7001 tcp-keepalive 30 timeout 0 maxmemory 5gb maxclients 10010 save "" client-output-buffer-limit normal 0 0 0 databases 1 maxmemory-policy allkeys-lru repl-backlog-size 256mb server-threads 4 maxstorage 20gb storage-provider flash /root/data repl-backlog-disk-reserve 1gb force-backlog-disk-reserve no client-output-buffer-limit replica 2000000kb 2000000kb 60 protected-mode no
[Note]: the slave configuration is the same, but just on the different VM.
Expected behavior
As a result of the fix, I will not see "0 QPS" throughput when running the same load as indicated above using that memtier command.
Additional information
Upon request, I can provide additional information. Moreover. Due to the same issue, I also saw "0 QPS" throughput when reading keys from the storage (e.g. SSD) during the maxmemory eviction process.
@JohnSully
maxmemory eviction cpu starvation.docx
Attached a screenshot of "0 qps" reported by the memtier client when running the maxmemory eviction process.
@paulmchen @msotheeswaran Hi. I have encountered a similar problem. Have you resolved this issue yet? https://github.com/Snapchat/KeyDB/issues/645