gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

bug: operator anti-pattern, validator pod deployments cause `CrashBackLoop` behaviour

Open justinthelaw opened this issue 1 year ago • 2 comments

HOST INFORMATION

  1. OS and Architecture: Ubuntu 22.04, amd64
  2. Kubernetes Distribution: K3s, K3d, RKE2
  3. Kubernetes Version: v1.30.4
  4. Host Node GPUs: NVIDIA RTX 4090 and 4070

DESCRIPTION

The NVIDIA GPU Operator validator contains hard-coded deployments of the CUDA validation and Plugin validation pods within the gpu-operator-validator daemonset's container. There is no way to influence the way these pods are deployed via values files, nor is there an easy way to manipulate the workload pods via post-deploy actions (e.g., kubectl delete), else the validation daemonset fails.

This is a Kubernetes Operator anti-pattern for these reasons:

  1. Declarative Mismatch: Hardcoding breaks Kubernetes’ declarative model, reducing flexibility and forcing redeployments for changes.
  2. Reduced Flexibility: Users can’t easily customize pods without modifying the operator itself.
  3. Operator Role: Operators should automate operational knowledge, not act as static YAML deployment tools.
  4. Maintenance Complexity: Embedded manifests complicate testing, maintenance, and reusability.

PROBLEM STATEMENT

This anti-pattern also led to issues in our secure runtime stack. Our service mesh, Istio, must be implemented to secure ingress/egress using NetPols via internal CRs, adding another layer of defense to all services within the cluster. There are no exceptions to this rule, and all namespaces should have Istio injection enabled, with explicit and justified pod-level exclusions (e.g., certain jobs)

To this end, we had to modify the existing gpu-operator-validator Dockerfile and validation workload pod manifests to explicitly, instead of broadly, exclude sidecar injection in the validation pods, else they would hang indefinitely. There was no way to do a post-deployment patch to end the validation pods so that the deployment would continue. Our efforts, e.g., post-deploy cron-jobs, lead to the gpu-operator-validator daemonset going into a pseudo-CrashBackLoop, where it would re-deploy the validation workload pods again every 5 minutes or it would go into an actual CrashBackLoop, ending the advancement of the overall deployment altogether.

Ultimately, our modifications described in the above paragraph were successful in allowing the validations to run, complete, and continue/finish the NVIDIA GPU Operator deployment without further issues, all while allowing Istio injection in the gpu-operator namespace.

As a bonus issue, the resources and limits are inconsistently applied to the initContainers and containers within both validation workload pods.

RECOMMENDATIONS

  1. Use CRDs, ConfigMaps, or straight helm chart templates to drive resource creation dynamically for flexibility and separation of concerns.
  2. Modify the existing, hard-coded deployment manifests for the validation pods to allow for lower-level templating in the metadata, security context, and resources, to name a few.

ADDITIONAL CONTEXT

Please note that it may also be possible to post-patch a ConfigMap with these particular hard-coded manifests into the daemonset as well.

Modified Dockerfile

ARG OPERATOR_VALIDATOR_IMAGE="nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2"

FROM $OPERATOR_VALIDATOR_IMAGE

RUN rm -rf /var/nvidia/manifests/cuda-workload-validation.yaml /var/nvidia/manifests/plugin-workload-validation.yaml

COPY ./src/validator-image/manifests/cuda-workload-validation.yaml /var/nvidia/manifests
COPY ./src/validator-image/manifests/plugin-workload-validation.yaml /var/nvidia/manifests

ENTRYPOINT ["/bin/bash"]

Example CUDA validation workload manifest

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: nvidia-cuda-validator
    sidecar.istio.io/inject: "false" # added line of most importance, other additions are secondary
  generateName: nvidia-cuda-validator-
  namespace: "FILLED_BY_THE_VALIDATOR"
spec:
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  restartPolicy: OnFailure
  serviceAccountName: nvidia-operator-validator
  initContainers:
    - name: cuda-validation
      image: "FILLED_BY_THE_VALIDATOR"
      imagePullPolicy: IfNotPresent
      command: ["sh", "-c"]
      args: ["vectorAdd"]
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
      securityContext:
        privileged: true
      # add resources and limits
      resources:
        requests:
          cpu: 50m
          memory: 32Mi
        limits:
          cpu: 100m
          memory: 64Mi
  containers:
    - name: nvidia-cuda-validator
      image: "FILLED_BY_THE_VALIDATOR"
      imagePullPolicy: IfNotPresent
      # override command and args as validation is already done by initContainer
      command: ["sh", "-c"]
      args: ["echo cuda workload validation is successful"]
      securityContext:
        privileged: true
        readOnlyRootFilesystem: true
      # add resources and limits
      resources:
        requests:
          cpu: 50m
          memory: 32Mi
        limits:
          cpu: 100m
          memory: 64Mi

justinthelaw avatar Nov 13 '24 15:11 justinthelaw

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 04 '25 22:11 github-actions[bot]

Hi @justinthelaw! Thanks for the detailed report. We'll look into the hard-coded validation pod manifests and explore options for making them more configurable to support service mesh integration scenarios like yours.

karthikvetrivel avatar Nov 24 '25 20:11 karthikvetrivel