gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

How to query the validation result using api?

Open chenditc opened this issue 1 year ago • 1 comments

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Mariner 2.0
  • Kernel Version: 6.5.1
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
  • GPU Operator Version: v24.3.0

2. Issue or feature description

I can see the nvidia-operator-validator finishes its init container and completes validation by checking the code and the status. But the result seems not reported to anywhere, like node label.

I have a operator to deploy machine learning model, I want to check the gpu validation result before I deploy the model, so that I can ensure there is sufficient hardware instead of hanging on "FailedToSchedule" state or failed at runtime.

Is there a way to query the validation result? I am thinking using nodel label or some api like /metrics.

chenditc avatar Jun 26 '24 04:06 chenditc

@chenditc one option is to enable the nvidia-node-status-exporter pod by enabling it in Clusterpolicy nodeStatusExporter.enabled=true. This component exposes the following metrics: https://github.com/NVIDIA/gpu-operator/blob/main/validator/metrics.go#L50-L68

cdesiniotis avatar Jul 12 '24 21:07 cdesiniotis

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 04 '25 22:11 github-actions[bot]

I'm closing this one as some information was provided. Please create a new issue

rajathagasthya avatar Nov 12 '25 23:11 rajathagasthya