Allocatable gpu value not correct after configuring time slicing
Allocatable gpu values not correct after configuring time slicing
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 12
Capacity:
cpu: 8
ephemeral-storage: 209702892Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32386524Ki
nvidia.com/gpu: 12
pods: 29
Allocatable:
cpu: 7910m
ephemeral-storage: 192188443124
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 31696348Ki
nvidia.com/gpu: 12
pods: 29
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 250m (3%) 0 (0%)
memory 10310Mi (33%) 10410Mi (33%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 4
Relevant node labels
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-172-17-12-22.eu-west-1.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=g4dn.2xlarge
nvidia.com/cuda.driver.major=535
nvidia.com/cuda.driver.minor=161
nvidia.com/cuda.driver.rev=07
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=2
nvidia.com/gfd.timestamp=1711077403
nvidia.com/gpu.compute.major=7
nvidia.com/gpu.compute.minor=5
nvidia.com/gpu.count=1
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=true
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.family=turing
nvidia.com/gpu.machine=g4dn.2xlarge
nvidia.com/gpu.memory=15360
nvidia.com/gpu.present=true
nvidia.com/gpu.product=Tesla-T4-SHARED
nvidia.com/gpu.replicas=12
nvidia.com/mig.capable=false
nvidia.com/mig.strategy=single
Pod status
NAME READY STATUS RESTARTS AGE
eu-west-1-dd-datadog-9fjwr 3/3 Running 0 4h2m
eu-west-1-dd-datadog-cluster-agent-79dbdcdd75-mt97s 1/1 Running 0 4h2m
eu-west-1-dd-datadog-s7527 3/3 Running 0 4h1m
eu-west-1-dd-kube-state-metrics-5b7b7bb44-mz4zm 1/1 Running 0 7d5h
eu-west-1-prod-gpuo-node-feature-discovery-gc-5b8848fhhpfw 1/1 Running 0 5h7m
eu-west-1-prod-gpuo-node-feature-discovery-master-58dt88ct 1/1 Running 0 5h7m
eu-west-1-prod-gpuo-node-feature-discovery-worker-r5pb2 1/1 Running 0 5h9m
eu-west-1-prod-gpuo-node-feature-discovery-worker-xgtst 1/1 Running 0 5h9m
gpu-feature-discovery-j72bc 2/2 Running 0 21m
gpu-feature-discovery-sqlc6 2/2 Running 0 20m
gpu-operator-675d95bdb9-zdhgw 1/1 Running 0 5h9m
nvidia-cuda-validator-cxr72 0/1 Completed 0 5h7m
nvidia-cuda-validator-pfnvx 0/1 Completed 0 5h7m
nvidia-dcgm-exporter-842sq 1/1 Running 0 5h7m
nvidia-dcgm-exporter-nvzn8 1/1 Running 0 5h7m
nvidia-device-plugin-daemonset-dlj4s 2/2 Running 0 59m
nvidia-device-plugin-daemonset-xkpqp 2/2 Running 0 59m
nvidia-operator-validator-ncgkr 1/1 Running 0 5h7m
nvidia-operator-validator-zrqd9 1/1 Running 0 5h7m
1. Quick Debug Information
Kernel Version: 5.10.210-201.852.amzn2.x86_64 OS Image: Amazon Linux 2 Operating System: linux Architecture: amd64 Container Runtime Version: containerd://1.7.11 Kubelet Version: v1.29.0-eks-5e0fdde Kube-Proxy Version: v1.29.0-eks-5e0fdde gpu-operator v23.9.2
2. Issue or feature description
Allocatable GPU should be 8
@shashiranjan84 in your time-slicing config you have set replicas: 12, hence those many GPUs are reported as allocatable. Not sure why you think it should be 8?
@shashiranjan84 in your time-slicing config you have set
replicas: 12, hence those many GPUs are reported as allocatable. Not sure why you think it should be 8?
Shouldn't be it showing how much gpu left? Then what is the difference b/w capacity and allocatable ? As 4 gpu already allocated, I thought allocatable should be 8, no?
@shashiranjan84 sorry i missed that you are running pods using GPUs. Yes, it should have been reflected. @klueska any thoughts?
Assuming below are the pods are using GPUs?
eu-west-1-dd-datadog-9fjwr 3/3 Running 0 4h2m
eu-west-1-dd-datadog-cluster-agent-79dbdcdd75-mt97s 1/1 Running 0 4h2m
eu-west-1-dd-datadog-s7527 3/3 Running 0 4h1m
eu-west-1-dd-kube-state-metrics-5b7b7bb44-mz4zm 1/1 Running 0 7d5h
Yes there is one pod running with GPU which can be seen in allocated resource
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 250m (3%) 0 (0%)
memory 10310Mi (33%) 10410Mi (33%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 4