gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Integrate ResourceQuota auto creation within GPU Operator

Open Dragoncell opened this issue 1 year ago • 2 comments

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): COS/Ubuntu
  • Kernel Version:
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE
  • GPU Operator Version: Any version

2. Issue or feature description

In practice, we create a new namespace to run GPU Operator. And the pods in GPU Operators are running in both system-node-critical and system-cluster-critical as Priority Class. For GKE, we need to manually apply the quota yaml config

> gpu-operator-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-operator-quota
spec:
  hard:
    pods: 100
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
        - system-node-critical
        - system-cluster-critical

> kubectl apply -n gpu-operator -f gpu-operator-quota.yaml

The goal is to explore any auto creation of the config if using the GKE and applied to the namespace created for GPU Operator

Dragoncell avatar Jan 20 '24 00:01 Dragoncell

@Dragoncell what is the rationale behind the count of 100? Is this a general restriction for GKE?

cdesiniotis avatar Jan 24 '24 19:01 cdesiniotis

I think 100 is mentioned in https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html, but I don't think the actual number matters (as long as it is sufficient to run all of the operator related pods). However, I do think we need to make sure that the resource quota is applied because by default if no resource quota is applied pods would fail to start within those priority classes.

/cc

bobbypage avatar Jan 24 '24 21:01 bobbypage