gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

RuntimeClass apiversion not right,I am using the latest gpu operator in master branch

Open 13567436138 opened this issue 1 year ago • 4 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

image

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):
  • Kernel Version:
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
  • GPU Operator Version:

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

4. Information to attach (optional if deemed irrelevant)

  • [ ] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • [ ] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • [ ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • [ ] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • [ ] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • [ ] containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

13567436138 avatar Jul 21 '24 09:07 13567436138

Please provide k8s about info, e.g

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):
  • Kernel Version:
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
  • GPU Operator Version:

lengrongfu avatar Jul 28 '24 02:07 lengrongfu

I am also seeing this as of the current state of the main branch.

  • OS/Version: Amazon Linux 2 GPU enabled
  • Kernel Version: 5.10.223-211.872.amzn2.x86_64
  • Container Runtime Type/Version: containerd 1.7.20
  • K8s Flavor/Version: EKS v1.29.6-eks-1552ad0
  • GPU Operator Version: main as of 22 August 2024

Log snippet from the gpu-operator container with the error shown below.

1.7243301748958206e+09  ERROR   controller.clusterpolicy-controller     Reconciler error        {"name": "cluster-policy", "namespace": "", "error": "no matches for kind \"RuntimeClass\" in version \"node.k8s.io/v1beta1\""}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227

chipzoller avatar Aug 22 '24 12:08 chipzoller

@13567436138 what helm chart are you deploying?

@chipzoller the helm chart on main has appVersion set to devel-ubi8, which will cause a very old (and not maintained) tag of gpu-operator to get pulled. Can you use a released helm chart from https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator?

cdesiniotis avatar Aug 22 '24 20:08 cdesiniotis

I ended up just changing the version info in the chart metadata file to reflect the latest chart release.

chipzoller avatar Aug 23 '24 15:08 chipzoller

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 04 '25 22:11 github-actions[bot]

Closing this based on the response provided here.

shivamerla avatar Nov 14 '25 06:11 shivamerla