gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Harden the image to support on GKE AutoPilot by default

Open Dentrax opened this issue 1 year ago • 1 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Container-Optimized OS from Google
  • Kernel Version: 5.15.109+
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.7.0
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE AutoPilot
  • GPU Operator Version: latest

2. Issue or feature description

gpu-operator does not get deploy on GKE AutoPilot clusters.

W0914 22:23:58.347610    4539 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated DaemonSet gpu-operator/gpu-operator-node-feature-discovery-worker: defaulted unspecified resources for containers [worker] (see http://g.co/gke/autopilot-defaults)
W0914 22:23:58.458291    4539 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment gpu-operator/gpu-operator: adjusted resources to meet requirements for containers [gpu-operator] (see http://g.co/gke/autopilot-resources)
W0914 22:23:58.474912    4539 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment gpu-operator/gpu-operator-node-feature-discovery-master: defaulted unspecified resources for containers [master] (see http://g.co/gke/autopilot-defaults)
Error: INSTALLATION FAILED: 2 errors occurred:
        * admission webhook "warden-validating.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by autogke-no-write-mode-hostpath]":["hostPath volume host-boot used in container worker uses path /boot which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume host-os-release used in container worker uses path /etc/os-release which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume host-sys used in container worker uses path /sys which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume host-usr-lib used in container worker uses path /usr/lib which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume host-lib used in container worker uses path /lib which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume source-d used in container worker uses path /etc/kubernetes/node-feature-discovery/source.d which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume features-d used in container worker uses path /etc/kubernetes/node-feature-discovery/features.d which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/]."]}
Requested by user: 'REDACTED', groups: 'system:authenticated'.
        * admission webhook "warden-validating.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by autogke-no-write-mode-hostpath]":["hostPath volume host-os-release used in container gpu-operator uses path /etc/os-release which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/]."]}
Requested by user: 'REDACTED', groups: 'system:authenticated'.

3. Steps to reproduce the issue

gcloud container clusters create-auto autopilot
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator

4. Information to attach (optional if deemed irrelevant)

No information since no running Pods.

Collecting full debug bundle (optional):

-

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Dentrax avatar Sep 14 '23 19:09 Dentrax

Any news on this?

bhack avatar Dec 30 '23 14:12 bhack