Integrate ResourceQuota auto creation within GPU Operator
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): COS/Ubuntu
- Kernel Version:
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE
- GPU Operator Version: Any version
2. Issue or feature description
In practice, we create a new namespace to run GPU Operator. And the pods in GPU Operators are running in both system-node-critical and system-cluster-critical as Priority Class. For GKE, we need to manually apply the quota yaml config
> gpu-operator-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-operator-quota
spec:
hard:
pods: 100
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values:
- system-node-critical
- system-cluster-critical
> kubectl apply -n gpu-operator -f gpu-operator-quota.yaml
The goal is to explore any auto creation of the config if using the GKE and applied to the namespace created for GPU Operator
@Dragoncell what is the rationale behind the count of 100? Is this a general restriction for GKE?
I think 100 is mentioned in https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html, but I don't think the actual number matters (as long as it is sufficient to run all of the operator related pods). However, I do think we need to make sure that the resource quota is applied because by default if no resource quota is applied pods would fail to start within those priority classes.
/cc