gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

ClusterPolicy fails with "missing required field defaultRuntime" when installed via ArgoCD

Open taejune opened this issue 2 weeks ago • 3 comments

Describe the bug hen installing GPU Operator via ArgoCD, the ClusterPolicy creation fails with the following error:

ClusterPolicy.spec.operator missing required field "defaultRuntime"

However, the same chart installs successfully when using helm install directly.

Environment

  • GPU Operator Version: [e.g., v25.3.4]
  • Kubernetes Version: [e.g., v1.33.5]
  • Installation Method: ArgoCD (Helm chart)
  • Container Runtime: containerd

Root Cause Analysis

1. Missing Template Rendering

The Helm chart template (templates/clusterpolicy.yaml) does not render the defaultRuntime field from values:

# Current template
spec:
  operator:
    {{- if .Values.operator.runtimeClass }}
    runtimeClass: {{ .Values.operator.runtimeClass }}
    {{- end }}
    {{- if .Values.operator.defaultGPUMode }}
    defaultGPUMode: {{ .Values.operator.defaultGPUMode }}
    {{- end }}
    # ❌ No defaultRuntime rendering!

Verification:

helm template gpu-operator nvidia/gpu-operator --version v24.9.0 | grep -A 20 "kind: ClusterPolicy"
# Result: No defaultRuntime field in the rendered manifest

2. CRD Schema

The CRD defines defaultRuntime with a default value:

defaultRuntime:
  type: string
  default: docker
  enum:
    - docker
    - crio
    - containerd

And it appears to be required (either explicitly or implicitly through schema validation).

3. Why Helm Install Works But ArgoCD Fails

Helm Direct Install (Client-Side Apply):

  1. Helm renders manifest without defaultRuntime
  2. kubectl applies using client-side apply
  3. API Server performs defaulting before/during required validation
  4. Default value docker is applied automatically
  5. ✅ Success

ArgoCD Install (Server-Side Apply):

  1. ArgoCD renders manifest without defaultRuntime
  2. ArgoCD applies using server-side apply (default behavior)
  3. Server-side apply performs stricter validation
  4. Required field check happens before defaulting can occur
  5. ❌ Fails with "missing required field"

This is a known Kubernetes behavior where server-side apply is more strict about required fields than client-side apply.

Steps to Reproduce

  1. Install ArgoCD in a cluster
  2. Create an ArgoCD Application:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: gpu-operator
  namespace: argocd
spec:
  project: default
  source:
    chart: gpu-operator
    repoURL: https://helm.ngc.nvidia.com/nvidia
    targetRevision: v24.9.0
    helm:
      values: |
        driver:
          enabled: true
  destination:
    server: https://kubernetes.default.svc
    namespace: gpu-operator
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
  1. Sync the application
  2. Observe the error: ClusterPolicy.spec.operator missing required field "defaultRuntime"

Current Workaround

Users must explicitly set the value in ArgoCD Application:

helm:
  values: |
    operator:
      defaultRuntime: containerd

However, this doesn't actually work because the template doesn't render it!

Alternative workaround - disable server-side apply:

syncPolicy:
  syncOptions:
    - ServerSideApply=false

Proposed Solution

Fix 1: Add defaultRuntime to Template (Recommended)

Update templates/clusterpolicy.yaml:

spec:
  operator:
    {{- if .Values.operator.defaultRuntime }}
    defaultRuntime: {{ .Values.operator.defaultRuntime }}
    {{- end }}
    {{- if .Values.operator.runtimeClass }}
    runtimeClass: {{ .Values.operator.runtimeClass }}
    {{- end }}

And ensure values.yaml has a default:

operator:
  defaultRuntime: docker  # or detect from cluster

Fix 2: Remove Required Constraint from CRD

If defaulting should handle this, consider making the field optional in the CRD and relying on the default value.

Fix 3: Add Mutating Webhook

Implement a mutating admission webhook to inject the default value before validation occurs.

Expected Behavior

GPU Operator should install successfully via ArgoCD without requiring users to:

  1. Explicitly set defaultRuntime in values (when template doesn't render it)
  2. Disable server-side apply
  3. Use workarounds

Additional Context

This issue affects all GitOps tools that use server-side apply by default (ArgoCD, Flux, etc.).

The combination of:

  • CRD with required + default fields
  • Helm template not rendering the field
  • Server-side apply's strict validation

Creates an incompatibility that only manifests in GitOps scenarios.

Related Issues

  • Similar issues have been reported in the Kubernetes community regarding server-side apply strictness with required+default fields
  • kubernetes/kubernetes#108008
  • kubernetes/kubernetes#99003

Suggested Priority

High - This breaks GPU Operator installation for all ArgoCD/GitOps users, which is a common deployment pattern in production environments.

taejune avatar Dec 12 '25 08:12 taejune

Thanks @taejune for reporting this. We'll have a look into it.

rahulait avatar Dec 12 '25 15:12 rahulait

@taejune Can you share the helm and ArgoCD versions used ?

tariq1890 avatar Dec 17 '25 19:12 tariq1890

@taejune we tried reproducing it with v25.10.1 and the example you had shared above but its working fine on our end.

Image

We are also seeing that defaultRuntime is getting set correctly to default value.

$ k get clusterpolicy cluster-policy -o yaml | grep -B1 defaultRuntime
  operator:
    defaultRuntime: docker

We tested this on argocd version v3.2.1

Can you share more details of your environment where you are hitting this issue?

rahulait avatar Dec 17 '25 21:12 rahulait