gpu-operator Integrate ResourceQuota auto creation within GPU Operator

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): COS/Ubuntu
Kernel Version:
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE
GPU Operator Version: Any version

2. Issue or feature description

In practice, we create a new namespace to run GPU Operator. And the pods in GPU Operators are running in both system-node-critical and system-cluster-critical as Priority Class. For GKE, we need to manually apply the quota yaml config

> gpu-operator-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-operator-quota
spec:
  hard:
    pods: 100
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
        - system-node-critical
        - system-cluster-critical

> kubectl apply -n gpu-operator -f gpu-operator-quota.yaml

The goal is to explore any auto creation of the config if using the GKE and applied to the namespace created for GPU Operator

Jan 20 '24 00:01 Dragoncell

@Dragoncell what is the rationale behind the count of 100? Is this a general restriction for GKE?

Jan 24 '24 19:01 cdesiniotis

I think 100 is mentioned in https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html, but I don't think the actual number matters (as long as it is sufficient to run all of the operator related pods). However, I do think we need to make sure that the resource quota is applied because by default if no resource quota is applied pods would fail to start within those priority classes.

/cc

Jan 24 '24 21:01 bobbypage