gpu-manager
gpu-manager copied to clipboard
Nvidia node mismatch for pod, pick up:/dev/nvidia6 predicate: /dev/nvidia1, which is unexpected.
I got a similar problem when I create a pod like issue 18. Please help analyze.
Warning UnexpectedAdmissionError 16m kubelet, ai-1080ti-62 Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod test3(test3), pick up:/dev/nvidia6 predicate: /dev/nvidia1, which is unexpected.
- test3.yaml
apiVersion: v1
kind: Pod
metadata:
name: test3
namespace: danlu-efficiency
spec:
restartPolicy: Never
schedulerName: gpu-admission
containers:
- image: danlu/tensorflow:tf1.9.0_py2_gpu_v0.1
name: test3
command:
- /bin/bash
- -c
- sleep 100000000
resources:
requests:
tencent.com/vcuda-core: 10
tencent.com/vcuda-memory: 40
limits:
tencent.com/vcuda-core: 10
tencent.com/vcuda-memory: 40
- kubectl describe pods test3 -n danlu-efficiency
Name: test3
Namespace: danlu-efficiency
Priority: 0
PriorityClassName: <none>
Node: ai-1080ti-62/
Start Time: Wed, 15 Jul 2020 14:54:42 +0800
Labels: <none>
Annotations: tencent.com/gpu-assigned: false
tencent.com/predicate-gpu-idx-0: 1
tencent.com/predicate-node: ai-1080ti-62
tencent.com/predicate-time: 1594796082180123795
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod test3(test3), pick up:/dev/nvidia6 predicate: /dev/nvidia1, which is unexpected.
IP:
Containers:
test3:
Image: danlu/tensorflow:tf1.9.0_py2_gpu_v0.1
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
sleep 100000000
Limits:
tencent.com/vcuda-core: 10
tencent.com/vcuda-memory: 40
Requests:
tencent.com/vcuda-core: 10
tencent.com/vcuda-memory: 40
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-p6lfp (ro)
Volumes:
default-token-p6lfp:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-p6lfp
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 17m gpu-admission Successfully assigned danlu-efficiency/test3 to ai-1080ti-62
Warning FailedScheduling 17m gpu-admission pod test3 had been predicated!
Warning UnexpectedAdmissionError 17m kubelet, ai-1080ti-62 Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod test3(test3), pick up:/dev/nvidia6 predicate: /dev/nvidia1, which is unexpected.
- The information of ai-1080ti-62 node
Name: ai-1080ti-62
Roles: nvidia418
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
hardware=NVIDIAGPU
hardware-type=NVIDIAGPU
kubernetes.io/hostname=ai-1080ti-62
node-role.kubernetes.io/nvidia418=nvidia418
nvidia-device-enable=enable
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 10.90.1.131/24
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 29 May 2019 18:02:54 +0800
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 15 Jul 2020 15:14:58 +0800 Wed, 15 Jul 2020 11:30:46 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 15 Jul 2020 15:14:58 +0800 Wed, 15 Jul 2020 11:30:46 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 15 Jul 2020 15:14:58 +0800 Wed, 15 Jul 2020 11:30:46 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 15 Jul 2020 15:14:58 +0800 Wed, 15 Jul 2020 11:30:46 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.90.1.131
Hostname: ai-1080ti-62
Capacity:
cpu: 56
ephemeral-storage: 1152148172Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 264029984Ki
nvidia.com/gpu: 8
pods: 110
tencent.com/vcuda-core: 800
tencent.com/vcuda-memory: 349
Allocatable:
cpu: 53
ephemeral-storage: 1040344917078
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 251344672Ki
nvidia.com/gpu: 8
pods: 110
tencent.com/vcuda-core: 800
tencent.com/vcuda-memory: 349
System Info:
Machine ID: bf90cb25500346cb8178be49909651e4
System UUID: 00000000-0000-0000-0000-ac1f6b93483c
Boot ID: 97927469-0e92-4816-880c-243a64ef293a
Kernel Version: 4.19.0-0.bpo.8-amd64
OS Image: Debian GNU/Linux 9 (stretch)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.2
Kubelet Version: v1.13.5
Kube-Proxy Version: v1.13.5
PodCIDR: 192.168.20.0/24
Non-terminated Pods: (58 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
......
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 51210m (96%) 97100m (183%)
memory 105732569856 (41%) 250822036Ki (99%)
ephemeral-storage 0 (0%) 0 (0%)
nvidia.com/gpu 8 8
tencent.com/vcuda-core 60 60
tencent.com/vcuda-memory 30 30
Events: <none>
It's a defensive mechanism for gpu-manager. The gpu-admission try to assign a pod to one card to avoid fragment, but the gpu-admission schedule information is not as new as the gpu-manager knows for some reason(pod terminated, pod failed etc). The gpu-manager will validate whether it's the same as the gpu-admission predicated, if not gpu-manager will reject it to keep the same allocation view.
Besides, your situation may be another scenario. We're working on this fix.
Today I try to reproduce the problem. First I create 7 NVIDIA GPU Pods, each occupying 1 GPU.
- The NVIDIA GPU Pod description
apiVersion: apps/v1
kind: Deployment
metadata:
name: nvidia-gpu-test-app-time-cost
namespace: danlu-efficiency
spec:
replicas: 7
selector:
matchLabels:
app: nvidia-gpu-test-app-time-cost
template:
metadata:
labels:
app: nvidia-gpu-test-app-time-cost
spec:
schedulerName: gpu-admission
restartPolicy: Always
containers:
- name: nvidia-gpu-test-app-time-cost
image: xxx:gpu-test-app-time-cost
resources:
#requests:
#tencent.com/vcuda-core: "20"
#tencent.com/vcuda-memory: "10"
limits:
nvidia.com/gpu: 1
#tencent.com/vcuda-core: "20"
#tencent.com/vcuda-memory: "10"
imagePullSecrets:
- name: gpu
- GPU information. We can see that there is an idle GPU#4.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 23% 39C P2 54W / 250W | 10661MiB / 11178MiB | 11% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 23% 42C P2 56W / 250W | 10661MiB / 11178MiB | 7% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 24% 43C P2 56W / 250W | 10661MiB / 11178MiB | 11% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 24% 44C P2 55W / 250W | 10661MiB / 11178MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 108... Off | 00000000:85:00.0 Off | N/A |
| 23% 26C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 108... Off | 00000000:86:00.0 Off | N/A |
| 23% 41C P2 54W / 250W | 10661MiB / 11178MiB | 8% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 108... Off | 00000000:89:00.0 Off | N/A |
| 23% 38C P2 54W / 250W | 10661MiB / 11178MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 108... Off | 00000000:8A:00.0 Off | N/A |
| 23% 39C P2 54W / 250W | 10661MiB / 11178MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 28425 C python 10651MiB |
| 1 32035 C python 10651MiB |
| 2 27356 C python 10651MiB |
| 3 30741 C python 10651MiB |
| 5 26997 C python 10651MiB |
| 6 27601 C python 10651MiB |
| 7 31145 C python 10651MiB |
+-----------------------------------------------------------------------------+
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity
GPU0 X PIX PHB PHB SYS SYS SYS SYS 0-13,28-41
GPU1 PIX X PHB PHB SYS SYS SYS SYS 0-13,28-41
GPU2 PHB PHB X PIX SYS SYS SYS SYS 0-13,28-41
GPU3 PHB PHB PIX X SYS SYS SYS SYS 0-13,28-41
GPU4 SYS SYS SYS SYS X PIX PHB PHB 14-27,42-55
GPU5 SYS SYS SYS SYS PIX X PHB PHB 14-27,42-55
GPU6 SYS SYS SYS SYS PHB PHB X PIX 14-27,42-55
GPU7 SYS SYS SYS SYS PHB PHB PIX X 14-27,42-55
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
Then, I create 1 Tencent GPU Pod, occupying 1/5 GPU and 1/4 GPU memory. I got the problem again and this pod always cycles between the two states of pending and UnexpectedAdmissionError.
apiVersion: apps/v1
kind: Deployment
metadata:
name: tencent-gpu-test-app-time-cost
namespace: danlu-efficiency
spec:
replicas: 1
selector:
matchLabels:
app: tencent-gpu-test-app-time-cost
template:
metadata:
labels:
app: tencent-gpu-test-app-time-cost
spec:
schedulerName: gpu-admission
restartPolicy: Always
containers:
- name: tencent-gpu-test-app-time-cost
image: xxx:gpu-test-app-time-cost
resources:
requests:
tencent.com/vcuda-core: "20"
tencent.com/vcuda-memory: "10"
limits:
tencent.com/vcuda-core: "20"
tencent.com/vcuda-memory: "10"
imagePullSecrets:
- name: gpu
-
kubectl describe po ...
output
Name: tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c
Namespace: danlu-efficiency
Priority: 0
PriorityClassName: <none>
Node: ai-1080ti-25/
Start Time: Wed, 26 Aug 2020 21:10:56 +0800
Labels: app=tencent-gpu-test-app-time-cost
pod-template-hash=7fc956cd5f
Annotations: tencent.com/gpu-assigned: false
tencent.com/predicate-gpu-idx-0: 0
tencent.com/predicate-node: ai-1080ti-25
tencent.com/predicate-time: 1598447456624461570
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c(tencent-gpu-test-app-time-cost), pick up:/dev/nvidia4 predicate: /dev/nvidia0, which is unexpected.
IP:
Controlled By: ReplicaSet/tencent-gpu-test-app-time-cost-7fc956cd5f
Containers:
tencent-gpu-test-app-time-cost:
Image: xxx:gpu-test-app-time-cost
Port: <none>
Host Port: <none>
Limits:
tencent.com/vcuda-core: 20
tencent.com/vcuda-memory: 10
Requests:
tencent.com/vcuda-core: 20
tencent.com/vcuda-memory: 10
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-b78wm (ro)
Volumes:
default-token-b78wm:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-b78wm
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 28s gpu-admission pod tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c had been predicated!
Normal Scheduled 28s gpu-admission Successfully assigned danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c to ai-1080ti-25
Warning UnexpectedAdmissionError 28s kubelet, ai-1080ti-25 Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c(tencent-gpu-test-app-time-cost), pick up:/dev/nvidia4 predicate: /dev/nvidia0, which is unexpected.
So I am confused why GPUManager chooses GPU#4. Shouldn't it choose GPU#0 in terms of resource utilization? Is GPU topology considered here? But why consider topology? This latter test program has nothing to do with other programs.
Today I try to reproduce the problem. First I create 7 NVIDIA GPU Pods, each occupying 1 GPU.
- The NVIDIA GPU Pod description
apiVersion: apps/v1 kind: Deployment metadata: name: nvidia-gpu-test-app-time-cost namespace: danlu-efficiency spec: replicas: 7 selector: matchLabels: app: nvidia-gpu-test-app-time-cost template: metadata: labels: app: nvidia-gpu-test-app-time-cost spec: schedulerName: gpu-admission restartPolicy: Always containers: - name: nvidia-gpu-test-app-time-cost image: xxx:gpu-test-app-time-cost resources: #requests: #tencent.com/vcuda-core: "20" #tencent.com/vcuda-memory: "10" limits: nvidia.com/gpu: 1 #tencent.com/vcuda-core: "20" #tencent.com/vcuda-memory: "10" imagePullSecrets: - name: gpu
- GPU information. We can see that there is an idle GPU#4.
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A | | 23% 39C P2 54W / 250W | 10661MiB / 11178MiB | 11% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A | | 23% 42C P2 56W / 250W | 10661MiB / 11178MiB | 7% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A | | 24% 43C P2 56W / 250W | 10661MiB / 11178MiB | 11% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A | | 24% 44C P2 55W / 250W | 10661MiB / 11178MiB | 5% Default | +-------------------------------+----------------------+----------------------+ | 4 GeForce GTX 108... Off | 00000000:85:00.0 Off | N/A | | 23% 26C P8 8W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 GeForce GTX 108... Off | 00000000:86:00.0 Off | N/A | | 23% 41C P2 54W / 250W | 10661MiB / 11178MiB | 8% Default | +-------------------------------+----------------------+----------------------+ | 6 GeForce GTX 108... Off | 00000000:89:00.0 Off | N/A | | 23% 38C P2 54W / 250W | 10661MiB / 11178MiB | 5% Default | +-------------------------------+----------------------+----------------------+ | 7 GeForce GTX 108... Off | 00000000:8A:00.0 Off | N/A | | 23% 39C P2 54W / 250W | 10661MiB / 11178MiB | 5% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 28425 C python 10651MiB | | 1 32035 C python 10651MiB | | 2 27356 C python 10651MiB | | 3 30741 C python 10651MiB | | 5 26997 C python 10651MiB | | 6 27601 C python 10651MiB | | 7 31145 C python 10651MiB | +-----------------------------------------------------------------------------+ GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity GPU0 X PIX PHB PHB SYS SYS SYS SYS 0-13,28-41 GPU1 PIX X PHB PHB SYS SYS SYS SYS 0-13,28-41 GPU2 PHB PHB X PIX SYS SYS SYS SYS 0-13,28-41 GPU3 PHB PHB PIX X SYS SYS SYS SYS 0-13,28-41 GPU4 SYS SYS SYS SYS X PIX PHB PHB 14-27,42-55 GPU5 SYS SYS SYS SYS PIX X PHB PHB 14-27,42-55 GPU6 SYS SYS SYS SYS PHB PHB X PIX 14-27,42-55 GPU7 SYS SYS SYS SYS PHB PHB PIX X 14-27,42-55 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge) PIX = Connection traversing a single PCIe switch NV# = Connection traversing a bonded set of # NVLinks
Then, I create 1 Tencent GPU Pod, occupying 1/5 GPU and 1/4 GPU memory. I got the problem again and this pod always cycles between the two states of pending and UnexpectedAdmissionError.
apiVersion: apps/v1 kind: Deployment metadata: name: tencent-gpu-test-app-time-cost namespace: danlu-efficiency spec: replicas: 1 selector: matchLabels: app: tencent-gpu-test-app-time-cost template: metadata: labels: app: tencent-gpu-test-app-time-cost spec: schedulerName: gpu-admission restartPolicy: Always containers: - name: tencent-gpu-test-app-time-cost image: xxx:gpu-test-app-time-cost resources: requests: tencent.com/vcuda-core: "20" tencent.com/vcuda-memory: "10" limits: tencent.com/vcuda-core: "20" tencent.com/vcuda-memory: "10" imagePullSecrets: - name: gpu
kubectl describe po ...
outputName: tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c Namespace: danlu-efficiency Priority: 0 PriorityClassName: <none> Node: ai-1080ti-25/ Start Time: Wed, 26 Aug 2020 21:10:56 +0800 Labels: app=tencent-gpu-test-app-time-cost pod-template-hash=7fc956cd5f Annotations: tencent.com/gpu-assigned: false tencent.com/predicate-gpu-idx-0: 0 tencent.com/predicate-node: ai-1080ti-25 tencent.com/predicate-time: 1598447456624461570 Status: Failed Reason: UnexpectedAdmissionError Message: Pod Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c(tencent-gpu-test-app-time-cost), pick up:/dev/nvidia4 predicate: /dev/nvidia0, which is unexpected. IP: Controlled By: ReplicaSet/tencent-gpu-test-app-time-cost-7fc956cd5f Containers: tencent-gpu-test-app-time-cost: Image: xxx:gpu-test-app-time-cost Port: <none> Host Port: <none> Limits: tencent.com/vcuda-core: 20 tencent.com/vcuda-memory: 10 Requests: tencent.com/vcuda-core: 20 tencent.com/vcuda-memory: 10 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-b78wm (ro) Volumes: default-token-b78wm: Type: Secret (a volume populated by a Secret) SecretName: default-token-b78wm Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 28s gpu-admission pod tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c had been predicated! Normal Scheduled 28s gpu-admission Successfully assigned danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c to ai-1080ti-25 Warning UnexpectedAdmissionError 28s kubelet, ai-1080ti-25 Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c(tencent-gpu-test-app-time-cost), pick up:/dev/nvidia4 predicate: /dev/nvidia0, which is unexpected.
So I am confused why GPUManager chooses GPU#4. Shouldn't it choose GPU#0 in terms of resource utilization? Is GPU topology considered here? But why consider topology? This latter test program has nothing to do with other programs.
GPUManager only consider the pod with its specification resource, even if your gpu card was occupied by some programs. Topology is considered because that some program may have p2p data transfer through gpu card by using nccl
, the link between two cards may affect the speed of data transfering.
PS: can you provide the log of chosen result of your situation?
Hi @mYmNeo , thanks for your quick answer. Actually, in order not to affect the current k8s environment, we created a new scheduler and started admission as its extended scheduler. Here is its description.
apiVersion: v1
kind: ServiceAccount
metadata:
name: gpu-admission
namespace: danlu-efficiency
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: gpu-admission-cluster-admin
namespace: danlu-efficiency
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
namespace: danlu-efficiency
name: gpu-admission
---
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-admission-config
namespace: danlu-efficiency
data:
config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
schedulerName: gpu-admission
algorithmSource:
policy:
configMap:
namespace: danlu-efficiency
name: gpu-admission-policy
leaderElection:
leaderElect: true
lockObjectName: gpu-admission
lockObjectNamespace: danlu-efficiency
---
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-admission-policy
namespace: danlu-efficiency
data:
policy.cfg : |
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
{"name" : "CheckNodeUnschedulable"},
{"name" : "GeneralPredicates"},
{"name" : "HostName"},
{"name" : "PodFitsHostPorts"},
{"name" : "MatchNodeSelector"},
{"name" : "PodFitsResources"},
{"name" : "NoDiskConflict"},
{"name" : "PodToleratesNodeTaints"},
{"name" : "MaxEBSVolumeCount"},
{"name" : "MaxGCEPDVolumeCount"},
{"name" : "MaxAzureDiskVolumeCount"},
{"name" : "CheckVolumeBinding"},
{"name" : "NoVolumeZoneConflict"},
{"name" : "MatchInterPodAffinity"}
],
"priorities" : [
{"name" : "EqualPriority", "weight" : 1},
{"name" : "MostRequestedPriority", "weight" : 1},
{"name" : "RequestedToCapacityRatioPriority", "weight" : 1},
{"name" : "SelectorSpreadPriority", "weight" : 1},
{"name" : "ServiceSpreadingPriority", "weight" : 1},
{"name" : "InterPodAffinityPriority", "weight" : 1},
{"name" : "LeastRequestedPriority", "weight" : 1},
{"name" : "BalancedResourceAllocation", "weight" : 1},
{"name" : "NodePreferAvoidPodsPriority", "weight" : 1},
{"name" : "NodeAffinityPriority", "weight" : 1},
{"name" : "TaintTolerationPriority", "weight" : 1},
{"name" : "ImageLocalityPriority", "weight" : 1}
],
"extenders" : [
{
"urlPrefix": "http://localhost:3456/scheduler",
"filterVerb": "predicates",
"enableHttps": false,
"nodeCacheCapable": false
}
],
"hardPodAffinitySymmetricWeight" : 10
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-admission
namespace: danlu-efficiency
labels:
app: gpu-admission
spec:
replicas: 1
selector:
matchLabels:
app: gpu-admission
template:
metadata:
labels:
app: gpu-admission
spec:
serviceAccountName: gpu-admission
volumes:
- name: gpu-admission-config
configMap:
name: gpu-admission-config
containers:
- name: gpu-admission-ctr
image: gcr.io/google_containers/hyperkube:v1.13.4
imagePullPolicy: IfNotPresent
args:
- kube-scheduler
- --config=/gpu-admission/config.yaml
- -v=4
volumeMounts:
- name: gpu-admission-config
mountPath: /gpu-admission
- name: gpu-admission-extender-ctr
image: xxx:gpu-admission-v0.1
imagePullPolicy: Always
livenessProbe:
httpGet:
path: /version
port: 3456
readinessProbe:
httpGet:
path: /version
port: 3456
ports:
- containerPort: 3456
imagePullSecrets:
- name: regcred
This is the gpu-admission-ctr
container log.
I0827 09:18:30.335120 1 scheduler.go:525] Attempting to schedule pod: danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-9gttf
E0827 09:18:30.338224 1 factory.go:1519] Error scheduling danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-9gttf: pod tencent-gpu-test-app-time-cost-7fc956cd5f-9gttf had been predicated!; retrying
I0827 09:18:30.338289 1 factory.go:1613] Updating pod condition for danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-9gttf to (PodScheduled==False)
E0827 09:18:30.339964 1 scheduler.go:546] error selecting node for pod: pod tencent-gpu-test-app-time-cost-7fc956cd5f-9gttf had been predicated!
I0827 09:18:31.465811 1 reflector.go:357] k8s.io/client-go/informers/factory.go:132: Watch close - *v1.ReplicaSet total 114 items received
I0827 09:18:39.729290 1 factory.go:1392] About to try and schedule pod danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92
I0827 09:18:39.729311 1 scheduler.go:525] Attempting to schedule pod: danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92
I0827 09:18:39.736815 1 scheduler_binder.go:207] AssumePodVolumes for pod "danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92", node "ai-1080ti-25"
I0827 09:18:39.736841 1 scheduler_binder.go:217] AssumePodVolumes for pod "danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92", node "ai-1080ti-25": all PVCs bound and nothing to do
I0827 09:18:39.736901 1 factory.go:1604] Attempting to bind tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92 to ai-1080ti-25
I0827 09:18:39.736909 1 factory.go:1392] About to try and schedule pod danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92
I0827 09:18:39.736923 1 scheduler.go:525] Attempting to schedule pod: danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92
E0827 09:18:39.739886 1 factory.go:1519] Error scheduling danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92: pod tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92 had been predicated!; retrying
I0827 09:18:39.739927 1 factory.go:1613] Updating pod condition for danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92 to (PodScheduled==False)
E0827 09:18:39.742164 1 scheduler.go:546] error selecting node for pod: pod tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92 had been predicated!
This is the gpu-admission-extender-ctr
container log.
W0827 08:49:06.183834 1 reflector.go:302] k8s.io/[email protected]/tools/cache/reflector.go:98: watch of *v1.ConfigMap ended with: too old resource version: 472556199 (472557230)
W0827 09:02:18.188971 1 reflector.go:302] k8s.io/[email protected]/tools/cache/reflector.go:98: watch of *v1.ConfigMap ended with: too old resource version: 472562446 (472562769)
W0827 09:18:30.195060 1 reflector.go:302] k8s.io/[email protected]/tools/cache/reflector.go:98: watch of *v1.ConfigMap ended with: too old resource version: 472567868 (472569885)
I am eager to know whether this problem is related to the mixed use of NVIDIA GPU Pods, because it will affect whether we can use it in our production environment. But what puzzles me is that some mixed use scenario can run correctly. Looking forward to your reply.
The gpu-admission doesn't has the view of your nvidia.com/gpu
pods, so it think the card 0 is the best fit for your pod, but gpu-manager has the real card usage, it recommend using card 4. Cluster should have only one controller to control gpu card.
The gpu-admission doesn't has the view of your
nvidia.com/gpu
pods, so it think the card 0 is the best fit for your pod, but gpu-manager has the real card usage, it recommend using card 4. Cluster should have only one controller to control gpu card.
I don't have NVIDIA/k8s-device-plugin installed, but also got this error
I met this issue too. I had add some debug logs as below:
I0122 04:41:31.480076 13774 tree.go:119] Update device information I0122 04:41:31.486222 13774 tree.go:135] node 0, pid: [], memory: 0, utilization: 0, pendingReset: false I0122 04:41:31.492388 13774 tree.go:135] node 1, pid: [], memory: 0, utilization: 0, pendingReset: false I0122 04:41:31.492898 13774 allocator.go:375] Tree graph: ROOT:2 |---PHB (aval: 2, pids: [], usedMemory: 0, totalMemory: 12456230912, allocatableCores: 0, allocatableMemory: 0) | |---GPU0 (pids: [], usedMemory: 0, totalMemory: 6233391104, allocatableCores: 100, allocatableMemory: 6233391104) | |---GPU1 (pids: [], usedMemory: 0, totalMemory: 6222839808, allocatableCores: 100, allocatableMemory: 6222839808) I0122 04:41:31.492918 13774 allocator.go:386] Try allocate for 15465015-367d-4f84-9610-0d220b917f99(nvidia), vcore 50, vmemory 3221225472 I0122 04:41:31.492943 13774 share.go:58] Pick up 1 mask 10, cores: 100, memory: 6222839808 I0122 04:41:31.493003 13774 allocator.go:445] devStr: /dev/nvidia0 I0122 04:41:31.493019 13774 allocator.go:447] predicateNode: GPU0 I0122 04:41:31.493043 13774 allocator.go:448] nodes[0]: GPU1 E0122 04:41:31.493056 13774 allocator.go:736] Nvidia node mismatch for pod vcuda(nvidia), pick up:/dev/nvidia1 predicate: /dev/nvidia0
I wonder why use node[0] particularly when has many cards? https://github.com/tkestack/gpu-manager/blob/808ff8c29a361f04499ff62242cd56e4f93089f6/pkg/services/allocator/nvidia/allocator.go#L452
After I removed the four lines then it works normally!
@qifengz it should be not recommended to delete those error checking code, it looks like worked because gpu manager
could actually use the nodes[0]
, but gpu admission
would mistakenly think the pod was using predicateNode
.
you can check this by nvidia-smi
and pod annotations
Hi @qifengz, @zwpaper has pointed out the reason. If you read the code, you may find that gpu-manager
actually sorts GPUs according to topology and the number of processes running on the GPU, but gpu-admission
DOES NOT KNOW these information.
https://github.com/tkestack/gpu-manager/blob/1d0955c69551018c04763675fad407f8243fdf4d/pkg/algorithm/nvidia/link.go#L42
https://github.com/tkestack/gpu-manager/blob/1d0955c69551018c04763675fad407f8243fdf4d/pkg/algorithm/nvidia/share.go#L47
- nvidia.ByType: sort by GPU topology (ref
nvidia-smi topo --matrix
) - nvidia.ByPids: sort by the number of processes running on the GPU
https://github.com/tkestack/gpu-admission/blob/47d56ae99ef7f24f2c9c4d33d17567e2e52f3ba2/pkg/algorithm/exclusive.go#L48-L51
https://github.com/tkestack/gpu-admission/blob/47d56ae99ef7f24f2c9c4d33d17567e2e52f3ba2/pkg/algorithm/share.go#L47
Therefore, my approach is to delete these two algorithms so as to be consistent with the gpu-admission
sorting algorithm. Although this will lose the key characteristics of gpu-manager
, it can minimize the probability of conflict. Hope it helps. :)
Hi @qifengz, @zwpaper has pointed out the reason. If you read the code, you may find that
gpu-manager
actually sorts GPUs according to topology and the number of processes running on the GPU, butgpu-admission
DOES NOT KNOW these information.https://github.com/tkestack/gpu-manager/blob/1d0955c69551018c04763675fad407f8243fdf4d/pkg/algorithm/nvidia/link.go#L42
https://github.com/tkestack/gpu-manager/blob/1d0955c69551018c04763675fad407f8243fdf4d/pkg/algorithm/nvidia/share.go#L47
- nvidia.ByType: sort by GPU topology (ref
nvidia-smi topo --matrix
)- nvidia.ByPids: sort by the number of processes running on the GPU
https://github.com/tkestack/gpu-admission/blob/47d56ae99ef7f24f2c9c4d33d17567e2e52f3ba2/pkg/algorithm/exclusive.go#L48-L51
https://github.com/tkestack/gpu-admission/blob/47d56ae99ef7f24f2c9c4d33d17567e2e52f3ba2/pkg/algorithm/share.go#L47
Therefore, my approach is to delete these two algorithms so as to be consistent with the
gpu-admission
sorting algorithm. Although this will lose the key characteristics ofgpu-manager
, it can minimize the probability of conflict. Hope it helps.
after delete,does it has some other problem?
Hi @qifengz, @zwpaper has pointed out the reason. If you read the code, you may find that
gpu-manager
actually sorts GPUs according to topology and the number of processes running on the GPU, butgpu-admission
DOES NOT KNOW these information. https://github.com/tkestack/gpu-manager/blob/1d0955c69551018c04763675fad407f8243fdf4d/pkg/algorithm/nvidia/link.go#L42https://github.com/tkestack/gpu-manager/blob/1d0955c69551018c04763675fad407f8243fdf4d/pkg/algorithm/nvidia/share.go#L47
- nvidia.ByType: sort by GPU topology (ref
nvidia-smi topo --matrix
)- nvidia.ByPids: sort by the number of processes running on the GPU
https://github.com/tkestack/gpu-admission/blob/47d56ae99ef7f24f2c9c4d33d17567e2e52f3ba2/pkg/algorithm/exclusive.go#L48-L51 https://github.com/tkestack/gpu-admission/blob/47d56ae99ef7f24f2c9c4d33d17567e2e52f3ba2/pkg/algorithm/share.go#L47 Therefore, my approach is to delete these two algorithms so as to be consistent with the
gpu-admission
sorting algorithm. Although this will lose the key characteristics ofgpu-manager
, it can minimize the probability of conflict. Hope it helps.after delete,does it has some other problem?
Performance is not very satisfactory under my test.
@qifengz Your case is the same as mine.
|---PHB (aval: 2, pids: [], usedMemory: 0, totalMemory: 12456230912, allocatableCores: 0, allocatableMemory: 0)
| |---GPU0 (pids: [], usedMemory: 0, totalMemory: 6233391104, allocatableCores: 100, allocatableMemory: 6233391104)
| |---GPU1 (pids: [], usedMemory: 0, totalMemory: 6222839808, allocatableCores: 100, allocatableMemory: 6222839808)
If you look carefully, you will find that the memory of GPU1 (6222839808
) is less than GPU0 (6233391104
).
So that even if the two GPUs are not allocated, gpu-manager will pick up GPU1 and gpu-admission will predicate GPU0, which leads to mismatch.
The code that caused this issue is in https://github.com/tkestack/gpu-manager/blob/808ff8c29a361f04499ff62242cd56e4f93089f6/pkg/device/nvidia/sort.go#L59-L61
#74 fixed it.
@fighterhit @HeroBcat Got you, helpful!
I fix the problem by delete the code,it works fine. Because the admission get the usage from k8s info, so if you don`t check this ,admission can also get the final scheduled info every 30s from k8s. If you want to use the fixed code,you can fork from my github. (I also fixed some build Error)
https://github.com/lynnfi/gpu-manager
After I removed the four lines then it works normally!
https://github.com/tkestack/gpu-manager/blob/808ff8c29a361f04499ff62242cd56e4f93089f6/pkg/services/allocator/nvidia/allocator.go#L452-L455