VictoriaMetrics icon indicating copy to clipboard operation
VictoriaMetrics copied to clipboard

vmstorage 100% CPU usage under mixed load (v1.108.1-cluster)

Open vskovpan-harmonicinc opened this issue 7 months ago • 10 comments

Summary

We are observing sustained 100% CPU usage in vmstorage under mixed write and read load. We’ve conducted detailed pprof analysis and Grafana monitoring to identify potential bottlenecks. Below are the findings and attached profiles.


Environment

  • VictoriaMetrics version: v1.108.1-cluster
  • Cluster mode: Yes (vmstorage, vminsert, vmselect)
  • Platform: GKE (Kubernetes)
  • Persistent storage: HDD-backed PVC

Node Allocation:

  • vmstorage: 1 node n2d-highmem-2 (2 vCPU, 16 GB RAM)
  • vminsert: 2 nodes e2-medium (2 vCPU, 4 GB RAM)
  • vmselect: 4 nodes e2-medium (2 vCPU, 4 GB RAM)

Observations

🔸 CPU usage

  • vmstorage CPU usage remains at ~100% for prolonged periods.
  • Spikes correlate with both query load and ingestion peaks.

What we've tried

  • Captured /debug/pprof/profile at 180s intervals. according to https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#profiling
  • Scaled vertically by increasing CPU/memory limits — no improvement.
  • Scaled horizontally by adding vmstorage replicas — no improvement.

Workaround

Temporarily mitigating CPU saturation by:

  • Cordoning the vmstorage node in Kubernetes.
  • Deleting the pod to force re-scheduling.

This temporarily relieves pressure but is not sustainable.


Attachments


Questions

  • Is there any known issue related to 100% CPU usage in vmstorage?
  • Can you please analyze the attached /debug/pprof/profile outputs and advise what might be causing the high CPU usage?

Thank you!

vskovpan-harmonicinc avatar May 11 '25 20:05 vskovpan-harmonicinc

Hello, I also recommend to check persistent disk metrics for your installation. Maybe it hit some gce limit https://cloud.google.com/compute/docs/disks/performance#performance_limits

VictoriaMetrics exposes the following disk throttling metrics itself:

rate(vm_filestream_read_duration_seconds_total)
rate(vm_filestream_write_duration_seconds_total)

f41gh7 avatar May 12 '25 12:05 f41gh7

Hello Nikolay,

Thanks for the suggestion,

We reviewed rate(vm_filestream_write_duration_seconds_total[5m]) and rate(vm_filestream_read_duration_seconds_total[5m]) and found a strong correlation with periods of sustained 100% CPU usage on vmstorage.

Y-axis interpretation: These metrics show how much time is spent on I/O per second (unit: seconds/second). Example: 0.03 = ~30 ms/sec spent writing or reading.

Key findings: Write latency reached 0.03–0.06 → ~30–60 ms/sec. Read latency spiked to 0.5 → ~500 ms/sec.

These spikes align with CPU saturation periods.

It's unclear whether CPU saturation causes the I/O latency spikes or if I/O delays are driving CPU load.

Image Image Image

vskovpan-harmonicinc avatar May 12 '25 16:05 vskovpan-harmonicinc

Hi guys, I have the same issue.

Node Allocation: vmstorage: 3 node n2d-highmem-4 vminsert: 6 nodes e2-medium vmselect: 3 nodes e2-medium

Attachment: vmstorage-cpu-120s.pprof vmstorage-cpu-120s.pprof.zip

igorrudyk avatar May 13 '25 17:05 igorrudyk

Key findings: Write latency reached 0.03–0.06 → ~30–60 ms/sec. Read latency spiked to 0.5 → ~500 ms/sec.

These spikes align with CPU saturation periods.

It's unclear whether CPU saturation causes the I/O latency spikes or if I/O delays are driving CPU load.

Thanks for the graphs and metrics. I think, switching from HDD to SSD storage should resolve your performance issue.

Disk read spikes above 0.5 indicates that there is high disk IO saturation. It most likely related to the churn rate spikes (it's hard to say without churn rate graphs, but it's the most common issue). During new series creation VictoriaMetrics performs on-disk seek requests at indexDB. And it could cause high CPU usage due to disk saturation.

f41gh7 avatar May 13 '25 18:05 f41gh7

Hi Nikolay,

Thanks for the advice.

We observed a significant churn rate spike of 153 new time series/sec at 2025-05-10 21:27:00 UTC+3. However, it's highly unlikely that this spike is directly responsible for the sustained 100% CPU usage on vmstorage that began approximately 16 hours later, around 2025-05-11 13:30–14:00 UTC+3.

Image

To help us further understand the root cause, could you please take a look at the vmstorage-cpu-180s.pprof capture recorded during the 100% CPU window? We would appreciate your insight into what might be consuming CPU resources so intensively. vmstorage-cpu-180s.pprof.zip

vskovpan-harmonicinc avatar May 15 '25 09:05 vskovpan-harmonicinc

Hi @f41gh7 , Did you have a chance to take a look at the Churn rate?

igorrudyk avatar May 19 '25 15:05 igorrudyk

Hi @f41gh7 , Did you have a chance to take a look at the Churn rate?

Hello, sorry for delay. Churn rate looks good to me. It doesn't have any kind of huge spikes. And according to the provided profile, vmstorage spends CPU on data ingestion and background merges.

I'd definitely recommend to give a try for SSD migration. It's the most probable reason of performance issue.

For testing only purpose, it's possible to restore current cluster from backup into new one with the same compute resources, but on SSD disks and replicate current ingestion into new cluster. It should provide performance insights.

f41gh7 avatar May 20 '25 12:05 f41gh7

Hi @f41gh7 ,

Not sure we can proceed with the HDD-to-SSD migration, as it's too costly for us.

Can you please review chat gpt analysis of good ~36% CPU usage vmstorage-cpu-180s-good.pprof.zip and 100% CPU usage vmstorage-cpu-180s.pprof.zip and share some ideas for a short/long term solution?

100% CPU usage: Analysis of CPU Profile (vmstorage_cpu_180s).pdf

100% CPU usage vs ~36% CPU usage: CPU Profile Analysis – Improved 36% CPU Usage vs Previous 100% CPU Usage.pdf

Summary: vmstorage_cpu_analysis.pdf

High CPU Usage (~100%) - Root Cause:

The profiling revealed that over 80% of CPU time was consumed by storage part merging and

sorting:

mergePartsInternal mergeBlockStreams mergeBlockStreamsInternal blockStreamWriter.WriteExternalBlock sort.pdqsort (recursive sorting)

Suggestions:

  1. Increase merge thresholds:

-storage.minMergeMultiplier=2.0-3.0 -storage.maxPartsInMerge=10-20 2. Reduce creation of small parts:

-insert.maxQueueDuration=1-2s -storage.maxInmemoryPartSize=64MB 3. Control background merger:

Try: -storage.disableBackgroundMerge 4. Reduce sort pressure:

Batch ingestion data Deduplicate upstream if possible

vskovpan-harmonicinc avatar May 21 '25 10:05 vskovpan-harmonicinc

@vskovpan-harmonicinc , sorry for delay, I think the following feature should help for your case https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6014

As soon as it'll be implemented, background merge speed could be limited. It should reduce disk io pressure ( and CPU for io wait as well) by cost of slower merges and potentially higher disk space usage.

f41gh7 avatar Jun 11 '25 13:06 f41gh7

Thanks @f41gh7 for the update. If there are any upcoming optimizations or suggestions that could help mitigate the 100% CPU usage in vmstorage, we’d appreciate it if you could keep us informed.

vskovpan-harmonicinc avatar Jun 12 '25 16:06 vskovpan-harmonicinc