gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Allocatable gpu value not correct after configuring time slicing

Open shashiranjan84 opened this issue 1 year ago • 5 comments

Allocatable gpu values not correct after configuring time slicing

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 12

Capacity:
  cpu:                8
  ephemeral-storage:  209702892Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32386524Ki
  nvidia.com/gpu:     12
  pods:               29
Allocatable:
  cpu:                7910m
  ephemeral-storage:  192188443124
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             31696348Ki
  nvidia.com/gpu:     12
  pods:               29


Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                250m (3%)      0 (0%)
  memory             10310Mi (33%)  10410Mi (33%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-1Gi      0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
  nvidia.com/gpu     4     

Relevant node labels

kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-172-17-12-22.eu-west-1.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=g4dn.2xlarge
nvidia.com/cuda.driver.major=535
nvidia.com/cuda.driver.minor=161
nvidia.com/cuda.driver.rev=07
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=2
nvidia.com/gfd.timestamp=1711077403
nvidia.com/gpu.compute.major=7
nvidia.com/gpu.compute.minor=5
nvidia.com/gpu.count=1
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=true
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.family=turing
nvidia.com/gpu.machine=g4dn.2xlarge
nvidia.com/gpu.memory=15360
nvidia.com/gpu.present=true
nvidia.com/gpu.product=Tesla-T4-SHARED
nvidia.com/gpu.replicas=12
nvidia.com/mig.capable=false
nvidia.com/mig.strategy=single

Pod status

NAME                                                              READY   STATUS      RESTARTS   AGE
eu-west-1-dd-datadog-9fjwr                                    3/3     Running     0          4h2m
eu-west-1-dd-datadog-cluster-agent-79dbdcdd75-mt97s           1/1     Running     0          4h2m
eu-west-1-dd-datadog-s7527                                    3/3     Running     0          4h1m
eu-west-1-dd-kube-state-metrics-5b7b7bb44-mz4zm               1/1     Running     0          7d5h
eu-west-1-prod-gpuo-node-feature-discovery-gc-5b8848fhhpfw   1/1     Running     0          5h7m
eu-west-1-prod-gpuo-node-feature-discovery-master-58dt88ct   1/1     Running     0          5h7m
eu-west-1-prod-gpuo-node-feature-discovery-worker-r5pb2      1/1     Running     0          5h9m
eu-west-1-prod-gpuo-node-feature-discovery-worker-xgtst      1/1     Running     0          5h9m
gpu-feature-discovery-j72bc                                       2/2     Running     0          21m
gpu-feature-discovery-sqlc6                                       2/2     Running     0          20m
gpu-operator-675d95bdb9-zdhgw                                     1/1     Running     0          5h9m
nvidia-cuda-validator-cxr72                                       0/1     Completed   0          5h7m
nvidia-cuda-validator-pfnvx                                       0/1     Completed   0          5h7m
nvidia-dcgm-exporter-842sq                                        1/1     Running     0          5h7m
nvidia-dcgm-exporter-nvzn8                                        1/1     Running     0          5h7m
nvidia-device-plugin-daemonset-dlj4s                              2/2     Running     0          59m
nvidia-device-plugin-daemonset-xkpqp                              2/2     Running     0          59m
nvidia-operator-validator-ncgkr                                   1/1     Running     0          5h7m
nvidia-operator-validator-zrqd9                                   1/1     Running     0          5h7m

1. Quick Debug Information

Kernel Version: 5.10.210-201.852.amzn2.x86_64 OS Image: Amazon Linux 2 Operating System: linux Architecture: amd64 Container Runtime Version: containerd://1.7.11 Kubelet Version: v1.29.0-eks-5e0fdde Kube-Proxy Version: v1.29.0-eks-5e0fdde gpu-operator v23.9.2

2. Issue or feature description

Allocatable GPU should be 8

shashiranjan84 avatar Mar 22 '24 03:03 shashiranjan84

@shashiranjan84 in your time-slicing config you have set replicas: 12, hence those many GPUs are reported as allocatable. Not sure why you think it should be 8?

shivamerla avatar Mar 25 '24 23:03 shivamerla

@shashiranjan84 in your time-slicing config you have set replicas: 12, hence those many GPUs are reported as allocatable. Not sure why you think it should be 8?

Shouldn't be it showing how much gpu left? Then what is the difference b/w capacity and allocatable ? As 4 gpu already allocated, I thought allocatable should be 8, no?

shashiranjan84 avatar Mar 26 '24 13:03 shashiranjan84

@shashiranjan84 sorry i missed that you are running pods using GPUs. Yes, it should have been reflected. @klueska any thoughts?

Assuming below are the pods are using GPUs?

eu-west-1-dd-datadog-9fjwr                                    3/3     Running     0          4h2m
eu-west-1-dd-datadog-cluster-agent-79dbdcdd75-mt97s           1/1     Running     0          4h2m
eu-west-1-dd-datadog-s7527                                    3/3     Running     0          4h1m
eu-west-1-dd-kube-state-metrics-5b7b7bb44-mz4zm               1/1     Running     0          7d5h

shivamerla avatar Mar 26 '24 15:03 shivamerla

Yes there is one pod running with GPU which can be seen in allocated resource

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                250m (3%)      0 (0%)
  memory             10310Mi (33%)  10410Mi (33%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-1Gi      0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
  nvidia.com/gpu     4  

shashiranjan84 avatar Mar 27 '24 19:03 shashiranjan84