opencost icon indicating copy to clipboard operation
opencost copied to clipboard

Custom Pricing ignored on GPU nodes

Open sia-mfierro opened this issue 7 months ago • 4 comments

Describe the bug GPU nodes do not use the specified custom pricing model. As a test, we set all prices to 1, but the GPU nodes are all over the place, with prices varying even among nodes with the same specs. All prices (GPU, RAM, CPU, Storage, etc.) are not respected.

To Reproduce

  1. Follow the On-prem installation instructions (deploy Prometheus with Helm, deploy OpenCost with Helm), but let all prices in the custom pricing model be 1
  2. Verify (either through Prometheus or directly the /metrics endpoint) that the prices are wrong for GPU equipped nodes. It seems, instead, that the total node hourly cost is assumed as 1, instead of each of the components' cost.

The issue happens whether DCGM exporter is scraped or not.

Expected behavior Would expect node_*_hourly_cost metrics to be all 1 (except where no applicable, e.g. no GPUs). It seems, instead, that the total node hourly. cost is assumed as 1

Logs Excerpt of the /metrics endpoint: a40-102 has NVIDIA A40 GPUs, a6000-10* nodes have NVIDIA a6000 GPUs, cpu-* nodes have no GPUs.

node_cpu_hourly_cost{arch="amd64",instance="a40-102",instance_type="",node="a40-102",provider_id="",region=""} 0.000828
node_cpu_hourly_cost{arch="amd64",instance="a6000-101",instance_type="",node="a6000-101",provider_id="",region=""} 0.000926
node_cpu_hourly_cost{arch="amd64",instance="a6000-102",instance_type="",node="a6000-102",provider_id="",region=""} 0.010364
node_cpu_hourly_cost{arch="amd64",instance="a6000-103",instance_type="",node="a6000-103",provider_id="",region=""} 0.010371
node_cpu_hourly_cost{arch="amd64",instance="cpu-101",instance_type="",node="cpu-101",provider_id="",region=""} 1
node_cpu_hourly_cost{arch="amd64",instance="cpu-102",instance_type="",node="cpu-102",provider_id="",region=""} 1
node_cpu_hourly_cost{arch="amd64",instance="cpu-201",instance_type="",node="cpu-201",provider_id="",region=""} 1
# ...
node_gpu_hourly_cost{arch="amd64",instance="a40-102",instance_type="",node="a40-102",provider_id="",region=""} 0.000828
node_gpu_hourly_cost{arch="amd64",instance="a6000-101",instance_type="",node="a6000-101",provider_id="",region=""} 0.000926
node_gpu_hourly_cost{arch="amd64",instance="a6000-102",instance_type="",node="a6000-102",provider_id="",region=""} 0.010364
node_gpu_hourly_cost{arch="amd64",instance="a6000-103",instance_type="",node="a6000-103",provider_id="",region=""} 0.010371
node_gpu_hourly_cost{arch="amd64",instance="cpu-101",instance_type="",node="cpu-101",provider_id="",region=""} 0
node_gpu_hourly_cost{arch="amd64",instance="cpu-102",instance_type="",node="cpu-102",provider_id="",region=""} 0
node_gpu_hourly_cost{arch="amd64",instance="cpu-201",instance_type="",node="cpu-201",provider_id="",region=""} 0
# ...
node_ram_hourly_cost{arch="amd64",instance="a40-102",instance_type="",node="a40-102",provider_id="",region=""} 0.000828
node_ram_hourly_cost{arch="amd64",instance="a6000-101",instance_type="",node="a6000-101",provider_id="",region=""} 0.000926
node_ram_hourly_cost{arch="amd64",instance="a6000-102",instance_type="",node="a6000-102",provider_id="",region=""} 0.010364
node_ram_hourly_cost{arch="amd64",instance="a6000-103",instance_type="",node="a6000-103",provider_id="",region=""} 0.010371
node_ram_hourly_cost{arch="amd64",instance="cpu-101",instance_type="",node="cpu-101",provider_id="",region=""} 1
node_ram_hourly_cost{arch="amd64",instance="cpu-102",instance_type="",node="cpu-102",provider_id="",region=""} 1
node_ram_hourly_cost{arch="amd64",instance="cpu-201",instance_type="",node="cpu-201",provider_id="",region=""} 1
# ...
node_total_hourly_cost{arch="amd64",instance="a40-102",instance_type="",node="a40-102",provider_id="",region=""} 0.9999761281127929
node_total_hourly_cost{arch="amd64",instance="a6000-101",instance_type="",node="a6000-101",provider_id="",region=""} 0.9998497504730224
node_total_hourly_cost{arch="amd64",instance="a6000-102",instance_type="",node="a6000-102",provider_id="",region=""} 0.9999910652618408
node_total_hourly_cost{arch="amd64",instance="a6000-103",instance_type="",node="a6000-103",provider_id="",region=""} 1.000016268951416
node_total_hourly_cost{arch="amd64",instance="cpu-101",instance_type="",node="cpu-101",provider_id="",region=""} 149.28325271606445
node_total_hourly_cost{arch="amd64",instance="cpu-102",instance_type="",node="cpu-102",provider_id="",region=""} 149.28325271606445
node_total_hourly_cost{arch="amd64",instance="cpu-201",instance_type="",node="cpu-201",provider_id="",region=""} 312.07775115966797

Prometheus extra scrape configs:

- job_name: opencost
  honor_labels: true
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  dns_sd_configs:
  - names:
    - opencost.opencost
    type: 'A'
    port: 9003
- job_name: node-exporter
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http

  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names:
      - monitoring  # node-exporter가 배포된 네임스페이스로 변경

  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
    action: keep
    regex: prometheus-node-exporter  # node-exporter의 label에 맞게 수정

  - source_labels: [__meta_kubernetes_pod_ip]
    target_label: __address__
    replacement: $1:9100  # node-exporter가 expose하는 포트

  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)

Custom pricing model ConfigMap:

Name:         custom-pricing-model
Namespace:    opencost
Labels:       app.kubernetes.io/managed-by=Helm
Annotations:  meta.helm.sh/release-name: opencost
              meta.helm.sh/release-namespace: opencost

Data
====
default.json:
----
{
  "CPU": "1",
  "GPU": "1",
  "RAM": "1",
  "description": "Modified pricing configuration.",
  "internetNetworkEgress": "1",
  "regionNetworkEgress": "1",
  "spotCPU": "1",
  "spotRAM": "1",
  "storage": "1",
  "zoneNetworkEgress": "1",
  "provider" : "custom"
}


BinaryData
====

Events:  <none>

Which version of OpenCost are you using? 1.114.0, 1.115.0 (issues persists after upgrade)

sia-mfierro avatar May 27 '25 04:05 sia-mfierro

I wonder if it might have anything to do with the changes in pkg/cloud/provider/customprovider.go.

I don't understand why an instance of CustomProvider (assumig that is the class used for a Custom Pricing Model) should return an empty string for GPU prices. Unless it is an unrelated class.

sia-mfierro avatar May 28 '25 22:05 sia-mfierro

~~When setting gpuLabel and gpuLabelValue and so labeling a node, the GPU price is respected (0.95) but the CPU and RAM prices are still computed according to ratios, instead of being the values set in the custom pricing model.~~

There was an error in the configuration. Disregard.

However, without explicitly labeling the GPU nodes, the issue persists.

sia-mfierro avatar Jun 03 '25 01:06 sia-mfierro

This behavior is also present in 1.118.0

romanpanov993 avatar Nov 18 '25 09:11 romanpanov993

Hey folks, the issue still persists and blocks use of custom pricing with GPU nodes, any estimates on fixing that behaviour?

zorek187 avatar Nov 18 '25 13:11 zorek187