Usage plugin reporting negative CPU usage values when using VictoriaMetrics (vmselect) as metrics source
Description
When using the usage plugin with the metrics endpoint configured against VictoriaMetrics vmselect, the scheduler periodically reports negative CPU usage in logs such as:
I1017 12:15:47.433614 1 usage.go:121] node:io-16, cpu usage:map[10m:-5035.937500138729], mem usage:map[10m:55.25365070508814]
I1017 12:15:47.433631 1 usage.go:121] node:io-35, cpu usage:map[10m:-24565.178572592726], mem usage:map[10m:2.2509139279284374]
This occurs even though the nodes are healthy and utilization metrics inside Prometheus/Grafana show valid positive CPU values.
This leads to the scheduler misjudging available resources, affecting job placement decisions.
Steps to reproduce the issue
Reproduction Steps:
- Deploy Volcano with the following scheduler configuration:
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- name: conformance
- name: usage
enablePredicate: false
arguments:
usage.weight: 5
cpu.weight: 1
memory.weight: 1
thresholds:
cpu: 80
mem: 70
- plugins:
- name: predicates
arguments:
predicate.CacheEnable: true
metrics:
type: prometheus
address: http://vmselect-vmcluster.monitoring-system.svc.cluster.local:8481/select/0/prometheus
interval: 30s
-
Observe the scheduler logs after 5–10 minutes of operation.
-
Notice negative CPU usage values for specific nodes despite valid metrics in VictoriaMetrics queries.
Expected Behavior:
CPU usage metrics should always be positive and correspond proportionally with node CPU load observed in Prometheus/VictoriaMetrics.
Describe the results you received and expected
CPU usage metrics should always be positive and correspond proportionally with node CPU load observed in Prometheus/VictoriaMetrics.
What version of Volcano are you using?
vc-scheduler:v1.12.2
Any other relevant information
No response
Hi Team, Anyone can help here?
Is that because the result from this query: https://github.com/volcano-sh/volcano/blob/8e72e97ce9dfdc9cf77c860a6058be45581a7e85/pkg/scheduler/metrics/source/metrics_client_prometheus.go#L82 is negative?
Hi @JesseStutler - I debugged it. Using rate() instead of irate() inside avg_over_time() fixed the issue of the negative CPU.
Before:
wget -qO- 'http://0.0.0.0:8481/select/0/prometheus/api/v1/query?query=avg_over_time%28%28100%20-%20%28avg%20by%20%28instance%29%20%28irate%28
node_cpu_seconds_total%7Bmode%3D"idle"%2Cinstance%3D"io-23"%7D%5B5m%5D%29%29%20*%20100%29%29%5B10m%5D%29'
{"status":"success","isPartial":false,"data":{"resultType":"vector","result":[{"metric":{"instance":"io-23"},"value":[1761890760,"-2465.27777787091"]}]}}/ #
After:
wget -qO- 'http://0.0.0.0:8481/select/0/prometheus/api/v1/query?query=avg_over_time%28%28100%20-%20%28avg%20by%20%28instance%29%20%28rate%28n
ode_cpu_seconds_total%7Bmode%3D%22idle%22%2Cinstance%3D%22io-23%22%7D%5B5m%5D%29%29%20*%20100%29%29%5B10m%5D%29'
{"status":"success","isPartial":false,"data":{"resultType":"vector","result":[{"metric":{"instance":"io-23"},"value":[1761890818,"7.674041666657047"]}]}}/ #
/close
The PR associated with this is merged.
@hajnalmt: Closing this issue.
In response to this:
/close
The PR associated with this is merged.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.