volcano Usage plugin reporting negative CPU usage values when using VictoriaMetrics (vmselect) as metrics source

Description

When using the usage plugin with the metrics endpoint configured against VictoriaMetrics vmselect, the scheduler periodically reports negative CPU usage in logs such as:

I1017 12:15:47.433614       1 usage.go:121] node:io-16, cpu usage:map[10m:-5035.937500138729], mem usage:map[10m:55.25365070508814]
I1017 12:15:47.433631       1 usage.go:121] node:io-35, cpu usage:map[10m:-24565.178572592726], mem usage:map[10m:2.2509139279284374]

This occurs even though the nodes are healthy and utilization metrics inside Prometheus/Grafana show valid positive CPU values.

This leads to the scheduler misjudging available resources, affecting job placement decisions.

Steps to reproduce the issue

Reproduction Steps:

Deploy Volcano with the following scheduler configuration:

actions: "enqueue, allocate, backfill"
tiers:
  - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
      - name: usage
        enablePredicate: false
        arguments:
          usage.weight: 5
          cpu.weight: 1
          memory.weight: 1
          thresholds:
            cpu: 80    
            mem: 70
  - plugins:
      - name: predicates
        arguments:
          predicate.CacheEnable: true

metrics:
  type: prometheus
  address: http://vmselect-vmcluster.monitoring-system.svc.cluster.local:8481/select/0/prometheus
  interval: 30s

Observe the scheduler logs after 5–10 minutes of operation.
Notice negative CPU usage values for specific nodes despite valid metrics in VictoriaMetrics queries.

Expected Behavior:

CPU usage metrics should always be positive and correspond proportionally with node CPU load observed in Prometheus/VictoriaMetrics.

Describe the results you received and expected

CPU usage metrics should always be positive and correspond proportionally with node CPU load observed in Prometheus/VictoriaMetrics.

What version of Volcano are you using?

vc-scheduler:v1.12.2

Any other relevant information

No response

Oct 21 '25 02:10 nitindhiman314e

Hi Team, Anyone can help here?

Oct 24 '25 08:10 nitindhiman314e

Is that because the result from this query: https://github.com/volcano-sh/volcano/blob/8e72e97ce9dfdc9cf77c860a6058be45581a7e85/pkg/scheduler/metrics/source/metrics_client_prometheus.go#L82 is negative?

Oct 31 '25 02:10 JesseStutler

Hi @JesseStutler - I debugged it. Using rate() instead of irate() inside avg_over_time() fixed the issue of the negative CPU.

Before:

wget -qO- 'http://0.0.0.0:8481/select/0/prometheus/api/v1/query?query=avg_over_time%28%28100%20-%20%28avg%20by%20%28instance%29%20%28irate%28
node_cpu_seconds_total%7Bmode%3D"idle"%2Cinstance%3D"io-23"%7D%5B5m%5D%29%29%20*%20100%29%29%5B10m%5D%29'

{"status":"success","isPartial":false,"data":{"resultType":"vector","result":[{"metric":{"instance":"io-23"},"value":[1761890760,"-2465.27777787091"]}]}}/ #

After:

wget -qO- 'http://0.0.0.0:8481/select/0/prometheus/api/v1/query?query=avg_over_time%28%28100%20-%20%28avg%20by%20%28instance%29%20%28rate%28n
ode_cpu_seconds_total%7Bmode%3D%22idle%22%2Cinstance%3D%22io-23%22%7D%5B5m%5D%29%29%20*%20100%29%29%5B10m%5D%29'

{"status":"success","isPartial":false,"data":{"resultType":"vector","result":[{"metric":{"instance":"io-23"},"value":[1761890818,"7.674041666657047"]}]}}/ #

Nov 03 '25 01:11 nitindhiman314e

/close

The PR associated with this is merged.

Dec 17 '25 14:12 hajnalmt

@hajnalmt: Closing this issue.

In response to this:

/close

The PR associated with this is merged.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Dec 17 '25 14:12 volcano-sh-bot