vmstorage 100% CPU usage under mixed load (v1.108.1-cluster)
Summary
We are observing sustained 100% CPU usage in vmstorage under mixed write and read load. We’ve conducted detailed pprof analysis and Grafana monitoring to identify potential bottlenecks. Below are the findings and attached profiles.
Environment
- VictoriaMetrics version:
v1.108.1-cluster - Cluster mode: Yes (
vmstorage,vminsert,vmselect) - Platform: GKE (Kubernetes)
- Persistent storage: HDD-backed PVC
Node Allocation:
vmstorage: 1 node n2d-highmem-2 (2 vCPU, 16 GB RAM)vminsert: 2 nodes e2-medium (2 vCPU, 4 GB RAM)vmselect: 4 nodes e2-medium (2 vCPU, 4 GB RAM)
Observations
🔸 CPU usage
vmstorageCPU usage remains at ~100% for prolonged periods.- Spikes correlate with both query load and ingestion peaks.
What we've tried
- Captured
/debug/pprof/profileat 180s intervals. according to https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#profiling - Scaled vertically by increasing CPU/memory limits — no improvement.
- Scaled horizontally by adding
vmstoragereplicas — no improvement.
Workaround
Temporarily mitigating CPU saturation by:
- Cordoning the
vmstoragenode in Kubernetes. - Deleting the pod to force re-scheduling.
This temporarily relieves pressure but is not sustainable.
Attachments
-
vmstorage-cpu-180s.pprofvmstorage-cpu-180s.pprof.zip -
Grafana graphs
Questions
- Is there any known issue related to 100% CPU usage in
vmstorage? - Can you please analyze the attached
/debug/pprof/profileoutputs and advise what might be causing the high CPU usage?
Thank you!
Hello, I also recommend to check persistent disk metrics for your installation. Maybe it hit some gce limit https://cloud.google.com/compute/docs/disks/performance#performance_limits
VictoriaMetrics exposes the following disk throttling metrics itself:
rate(vm_filestream_read_duration_seconds_total)
rate(vm_filestream_write_duration_seconds_total)
Hello Nikolay,
Thanks for the suggestion,
We reviewed rate(vm_filestream_write_duration_seconds_total[5m]) and rate(vm_filestream_read_duration_seconds_total[5m]) and found a strong correlation with periods of sustained 100% CPU usage on vmstorage.
Y-axis interpretation: These metrics show how much time is spent on I/O per second (unit: seconds/second). Example: 0.03 = ~30 ms/sec spent writing or reading.
Key findings: Write latency reached 0.03–0.06 → ~30–60 ms/sec. Read latency spiked to 0.5 → ~500 ms/sec.
These spikes align with CPU saturation periods.
It's unclear whether CPU saturation causes the I/O latency spikes or if I/O delays are driving CPU load.
Hi guys, I have the same issue.
Node Allocation: vmstorage: 3 node n2d-highmem-4 vminsert: 6 nodes e2-medium vmselect: 3 nodes e2-medium
Attachment: vmstorage-cpu-120s.pprof vmstorage-cpu-120s.pprof.zip
Key findings: Write latency reached 0.03–0.06 → ~30–60 ms/sec. Read latency spiked to 0.5 → ~500 ms/sec.
These spikes align with CPU saturation periods.
It's unclear whether CPU saturation causes the I/O latency spikes or if I/O delays are driving CPU load.
Thanks for the graphs and metrics. I think, switching from HDD to SSD storage should resolve your performance issue.
Disk read spikes above 0.5 indicates that there is high disk IO saturation. It most likely related to the churn rate spikes (it's hard to say without churn rate graphs, but it's the most common issue). During new series creation VictoriaMetrics performs on-disk seek requests at indexDB. And it could cause high CPU usage due to disk saturation.
Hi Nikolay,
Thanks for the advice.
We observed a significant churn rate spike of 153 new time series/sec at 2025-05-10 21:27:00 UTC+3. However, it's highly unlikely that this spike is directly responsible for the sustained 100% CPU usage on vmstorage that began approximately 16 hours later, around 2025-05-11 13:30–14:00 UTC+3.
To help us further understand the root cause, could you please take a look at the vmstorage-cpu-180s.pprof capture recorded during the 100% CPU window? We would appreciate your insight into what might be consuming CPU resources so intensively. vmstorage-cpu-180s.pprof.zip
Hi @f41gh7 , Did you have a chance to take a look at the Churn rate?
Hi @f41gh7 , Did you have a chance to take a look at the Churn rate?
Hello, sorry for delay. Churn rate looks good to me. It doesn't have any kind of huge spikes. And according to the provided profile, vmstorage spends CPU on data ingestion and background merges.
I'd definitely recommend to give a try for SSD migration. It's the most probable reason of performance issue.
For testing only purpose, it's possible to restore current cluster from backup into new one with the same compute resources, but on SSD disks and replicate current ingestion into new cluster. It should provide performance insights.
Hi @f41gh7 ,
Not sure we can proceed with the HDD-to-SSD migration, as it's too costly for us.
Can you please review chat gpt analysis of good ~36% CPU usage vmstorage-cpu-180s-good.pprof.zip and 100% CPU usage vmstorage-cpu-180s.pprof.zip and share some ideas for a short/long term solution?
100% CPU usage: Analysis of CPU Profile (vmstorage_cpu_180s).pdf
100% CPU usage vs ~36% CPU usage: CPU Profile Analysis – Improved 36% CPU Usage vs Previous 100% CPU Usage.pdf
Summary: vmstorage_cpu_analysis.pdf
High CPU Usage (~100%) - Root Cause:
The profiling revealed that over 80% of CPU time was consumed by storage part merging and
sorting:
mergePartsInternal mergeBlockStreams mergeBlockStreamsInternal blockStreamWriter.WriteExternalBlock sort.pdqsort (recursive sorting)
Suggestions:
- Increase merge thresholds:
-storage.minMergeMultiplier=2.0-3.0 -storage.maxPartsInMerge=10-20 2. Reduce creation of small parts:
-insert.maxQueueDuration=1-2s -storage.maxInmemoryPartSize=64MB 3. Control background merger:
Try: -storage.disableBackgroundMerge 4. Reduce sort pressure:
Batch ingestion data Deduplicate upstream if possible
@vskovpan-harmonicinc , sorry for delay, I think the following feature should help for your case https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6014
As soon as it'll be implemented, background merge speed could be limited. It should reduce disk io pressure ( and CPU for io wait as well) by cost of slower merges and potentially higher disk space usage.
Thanks @f41gh7 for the update. If there are any upcoming optimizations or suggestions that could help mitigate the 100% CPU usage in vmstorage, we’d appreciate it if you could keep us informed.