Custom Pricing ignored on GPU nodes
Describe the bug GPU nodes do not use the specified custom pricing model. As a test, we set all prices to 1, but the GPU nodes are all over the place, with prices varying even among nodes with the same specs. All prices (GPU, RAM, CPU, Storage, etc.) are not respected.
To Reproduce
- Follow the On-prem installation instructions (deploy Prometheus with Helm, deploy OpenCost with Helm), but let all prices in the custom pricing model be 1
- Verify (either through Prometheus or directly the
/metricsendpoint) that the prices are wrong for GPU equipped nodes. It seems, instead, that the total node hourly cost is assumed as 1, instead of each of the components' cost.
The issue happens whether DCGM exporter is scraped or not.
Expected behavior
Would expect node_*_hourly_cost metrics to be all 1 (except where no applicable, e.g. no GPUs). It seems, instead, that the total node hourly. cost is assumed as 1
Logs
Excerpt of the /metrics endpoint: a40-102 has NVIDIA A40 GPUs, a6000-10* nodes have NVIDIA a6000 GPUs, cpu-* nodes have no GPUs.
node_cpu_hourly_cost{arch="amd64",instance="a40-102",instance_type="",node="a40-102",provider_id="",region=""} 0.000828
node_cpu_hourly_cost{arch="amd64",instance="a6000-101",instance_type="",node="a6000-101",provider_id="",region=""} 0.000926
node_cpu_hourly_cost{arch="amd64",instance="a6000-102",instance_type="",node="a6000-102",provider_id="",region=""} 0.010364
node_cpu_hourly_cost{arch="amd64",instance="a6000-103",instance_type="",node="a6000-103",provider_id="",region=""} 0.010371
node_cpu_hourly_cost{arch="amd64",instance="cpu-101",instance_type="",node="cpu-101",provider_id="",region=""} 1
node_cpu_hourly_cost{arch="amd64",instance="cpu-102",instance_type="",node="cpu-102",provider_id="",region=""} 1
node_cpu_hourly_cost{arch="amd64",instance="cpu-201",instance_type="",node="cpu-201",provider_id="",region=""} 1
# ...
node_gpu_hourly_cost{arch="amd64",instance="a40-102",instance_type="",node="a40-102",provider_id="",region=""} 0.000828
node_gpu_hourly_cost{arch="amd64",instance="a6000-101",instance_type="",node="a6000-101",provider_id="",region=""} 0.000926
node_gpu_hourly_cost{arch="amd64",instance="a6000-102",instance_type="",node="a6000-102",provider_id="",region=""} 0.010364
node_gpu_hourly_cost{arch="amd64",instance="a6000-103",instance_type="",node="a6000-103",provider_id="",region=""} 0.010371
node_gpu_hourly_cost{arch="amd64",instance="cpu-101",instance_type="",node="cpu-101",provider_id="",region=""} 0
node_gpu_hourly_cost{arch="amd64",instance="cpu-102",instance_type="",node="cpu-102",provider_id="",region=""} 0
node_gpu_hourly_cost{arch="amd64",instance="cpu-201",instance_type="",node="cpu-201",provider_id="",region=""} 0
# ...
node_ram_hourly_cost{arch="amd64",instance="a40-102",instance_type="",node="a40-102",provider_id="",region=""} 0.000828
node_ram_hourly_cost{arch="amd64",instance="a6000-101",instance_type="",node="a6000-101",provider_id="",region=""} 0.000926
node_ram_hourly_cost{arch="amd64",instance="a6000-102",instance_type="",node="a6000-102",provider_id="",region=""} 0.010364
node_ram_hourly_cost{arch="amd64",instance="a6000-103",instance_type="",node="a6000-103",provider_id="",region=""} 0.010371
node_ram_hourly_cost{arch="amd64",instance="cpu-101",instance_type="",node="cpu-101",provider_id="",region=""} 1
node_ram_hourly_cost{arch="amd64",instance="cpu-102",instance_type="",node="cpu-102",provider_id="",region=""} 1
node_ram_hourly_cost{arch="amd64",instance="cpu-201",instance_type="",node="cpu-201",provider_id="",region=""} 1
# ...
node_total_hourly_cost{arch="amd64",instance="a40-102",instance_type="",node="a40-102",provider_id="",region=""} 0.9999761281127929
node_total_hourly_cost{arch="amd64",instance="a6000-101",instance_type="",node="a6000-101",provider_id="",region=""} 0.9998497504730224
node_total_hourly_cost{arch="amd64",instance="a6000-102",instance_type="",node="a6000-102",provider_id="",region=""} 0.9999910652618408
node_total_hourly_cost{arch="amd64",instance="a6000-103",instance_type="",node="a6000-103",provider_id="",region=""} 1.000016268951416
node_total_hourly_cost{arch="amd64",instance="cpu-101",instance_type="",node="cpu-101",provider_id="",region=""} 149.28325271606445
node_total_hourly_cost{arch="amd64",instance="cpu-102",instance_type="",node="cpu-102",provider_id="",region=""} 149.28325271606445
node_total_hourly_cost{arch="amd64",instance="cpu-201",instance_type="",node="cpu-201",provider_id="",region=""} 312.07775115966797
Prometheus extra scrape configs:
- job_name: opencost
honor_labels: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
dns_sd_configs:
- names:
- opencost.opencost
type: 'A'
port: 9003
- job_name: node-exporter
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- monitoring # node-exporter가 배포된 네임스페이스로 변경
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
action: keep
regex: prometheus-node-exporter # node-exporter의 label에 맞게 수정
- source_labels: [__meta_kubernetes_pod_ip]
target_label: __address__
replacement: $1:9100 # node-exporter가 expose하는 포트
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
Custom pricing model ConfigMap:
Name: custom-pricing-model
Namespace: opencost
Labels: app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: opencost
meta.helm.sh/release-namespace: opencost
Data
====
default.json:
----
{
"CPU": "1",
"GPU": "1",
"RAM": "1",
"description": "Modified pricing configuration.",
"internetNetworkEgress": "1",
"regionNetworkEgress": "1",
"spotCPU": "1",
"spotRAM": "1",
"storage": "1",
"zoneNetworkEgress": "1",
"provider" : "custom"
}
BinaryData
====
Events: <none>
Which version of OpenCost are you using? 1.114.0, 1.115.0 (issues persists after upgrade)
I wonder if it might have anything to do with the changes in pkg/cloud/provider/customprovider.go.
I don't understand why an instance of CustomProvider (assumig that is the class used for a Custom Pricing Model) should return an empty string for GPU prices. Unless it is an unrelated class.
~~When setting gpuLabel and gpuLabelValue and so labeling a node, the GPU price is respected (0.95) but the CPU and RAM prices are still computed according to ratios, instead of being the values set in the custom pricing model.~~
There was an error in the configuration. Disregard.
However, without explicitly labeling the GPU nodes, the issue persists.
This behavior is also present in 1.118.0
Hey folks, the issue still persists and blocks use of custom pricing with GPU nodes, any estimates on fixing that behaviour?